File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1014_metho.xml
Size: 18,536 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1014"> <Title>Training Tree Transducers</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Regular Tree Grammars </SectionTitle> <Paragraph position="0"> In this section, we describe the regular tree grammar, a common way of compactly representing a potentially infinite set of trees (similar to the role played by the finite-state acceptor FSA for strings). We describe the version (equivalent to TSG (Schabes, 1990)) where the generated trees are given weights, as are strings in a WFSA.</Paragraph> <Paragraph position="1"> A weighted regular tree grammar (wRTG) G is a quadruple (S,N,S,P), where S is the alphabet, N is the finite set of nonterminals, S [?]N is the start (or initial) nonterminal, and P [?]NxTS(N)xR+ is the finite set of weighted productions (R+ [?]{r[?]R|r > 0}). A production (lhs,rhs,w) is written lhs-w rhs. Productions whose rhs contains no nonterminals (rhs [?] TS) are called terminal productions, and rules of the form A -w B, for A,B [?] N are called o-productions, or epsilon productions, and can be used in lieu of multiple initial nonterminals.</Paragraph> <Paragraph position="2"> Figure 2 shows a sample wRTG. This grammar accepts an infinite number of trees. The tree S(NP(DT(the), N(sons)), VP(V(run))) comes out with probability 0.3.</Paragraph> <Paragraph position="3"> We define the binary relation=G (single-step derives in G) on TS(N)x(pathsxP)[?], pairs of trees and derivation histories, which are logs of (location, production where (a,h)=G (b,h*(p,(l,r,w))) iff tree b may be derived from tree a by using the rule l -w r to replace the nonterminal leaf l at path p with r. For a derivation history h = ((p1,(l1,r1,w1)),...,(pn,(l1,r1,w1))), the weight of h is w(h)[?]producttextni=1 wi, and call h leftmost if</Paragraph> <Paragraph position="5"> The reflexive, transitive closure of=G is written=[?]G (derives in G), and the restriction of =[?]G to leftmost derivation histories is=L[?]G (leftmost derives in G).</Paragraph> <Paragraph position="6"> The weight of a becoming b in G is wG(a,b) [?]summationtext h:(a,())=L[?]G (b,h) w(h), the sum of weights of all unique (leftmost) derivations transforming a to b, and the weight of t in G is WG(t) = wG(S,t). The weighted regular tree language produced by G is LG [?] {(t,w) [?] TSxR+|WG(t) = w}.</Paragraph> <Paragraph position="7"> For every weighted context-free grammar, there is an equivalent wRTG that produces its weighted derivation trees with yields being the string produced, and the yields of regular tree grammars are context free string languages (Gecseg and Steinby, 1984).</Paragraph> <Paragraph position="8"> What is sometimes called a forest in natural language generation (Langkilde, 2000; Nederhof and Satta, 2002) is a finite wRTG without loops, i.e.,[?]n[?]N(n,())=[?]G</Paragraph> <Paragraph position="10"> are strictly contained in tree sets of tree adjoining grammars (Joshi and Schabes, 1997).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Extended-LHS Tree Transducers (xR) </SectionTitle> <Paragraph position="0"> Section 1 informally described the root-to-frontier transducer class R. We saw that R allows, by use of states, finite lookahead and arbitrary rearrangement of nonsibling input subtrees removed by a finite distance. However, it is often easier to write rules that explicitly represent such lookahead and movement, relieving the burden on the user to produce the requisite intermediary rules and states. We define xR, a convenience-oriented generalization of weighted R. Because of its good fit to natural language problems, xR is already briefly touched on, though not defined, in (Rounds, 1970).</Paragraph> <Paragraph position="1"> A weighted extended-lhs root-to-frontier tree transducer X is a quintuple (S,[?],Q,Qi,R) where S is the input alphabet, and [?] is the output alphabet, Q is a finite set of states, Qi [?]Q is the initial (or start, or root) state, and R[?]QxXRPATSxT[?](Qxpaths)xR+ is a finite set of weighted transformation rules, written (q,pattern) -w rhs, meaning that an input subtree matching pattern while in state q is transformed into rhs, with Qxpaths leaves replaced by their (recursive) transformations. The Qxpaths leaves of a rhs are called nonterminals (there may also be terminal leaves labeled by the output tree alphabet [?]).</Paragraph> <Paragraph position="2"> XRPATS is the set of finite tree patterns: predicate functions f : TS -{0,1}that depend only on the label and rank of a finite number of fixed paths their input. xR is the set of all such transducers. R, the set of conventional top-down transducers, is a subset of xR where the rules are restricted to use finite tree patterns that depend only on the root: RPATS [?]{ps,r(t)}where</Paragraph> <Paragraph position="4"> Rules whose rhs are a pure T[?] with no states/paths for further expansion are called terminal rules. Rules of the form (q,pat) -w (q',()) are o-rules, or epsilon rules, which substitute state q' for state q without producing output, and stay at the current input subtree. Multiple initial states are not needed: we can use a single start state Qi, and instead of each initial state q with starting weight w add the rule (Qi,TRUE) -w (q,()) (where TRUE(t)[?]1,[?]t).</Paragraph> <Paragraph position="5"> We define the binary relation=X for xR tranducer X on TS[?][?][?]Qx(pathsxR)[?], pairs of partially transformed (working) trees and derivation histories:</Paragraph> <Paragraph position="7"> That is, b is derived from a by application of a rule (q,pat) -w r to an unprocessed input subtree a |i which is in state q, replacing it by output given by r, with its nonterminals replaced by the instruction to transform descendant input subtrees at relative path i' in state q'.</Paragraph> <Paragraph position="8"> The sources of a rule r = (q,l,rhs,w)[?]R are the inputpath parts of the rhs nonterminals:</Paragraph> <Paragraph position="10"> If the sources of a rule refer to input paths that do not exist in the input, then the rule cannot apply (because a|(i*(1)*i') would not exist). In the traditional statement of R, sources(rhs) is always{(1),...,(n)}, writing xi instead of (i), but in xR, we identify mapped input subtrees by arbitrary (finite) paths.</Paragraph> <Paragraph position="11"> An input tree is transformed by starting at the root in the initial state, and recursively applying outputgenerating rules to a frontier of (copies of) input subtrees (each marked with their own state), until (in a complete derivation, finishing at the leaves with terminal rules) no states remain.</Paragraph> <Paragraph position="12"> Let =[?]X, =L[?]X , and wX(a,b) follow from =X exactly as in Section 3. Then the weight of (i,o) in X is WX(i,o) [?] wX(Qi(i),o). The weighted tree transduction given by X is XX [?]{(i,o,w) [?] TSxT[?]x</Paragraph> <Paragraph position="14"/> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Parsing a Tree Transduction </SectionTitle> <Paragraph position="0"> Derivation trees for a transducer X = (S,[?],Q,Qi,R) are trees labeled by rules (R) that dictate the choice of rules in a complete X-derivation. Figure 3 shows derivation trees for a particular transducer. In order to generate derivation trees for X automatically, we build a modified transducer X'. This new transducer produces derivation trees on its output instead of normal output trees. X' is</Paragraph> <Paragraph position="2"> That is, the original rhs of rules are flattened into a tree of depth 1, with the root labeled by the original rule, and all the non-expanding [?]-labeled nodes of the rhs removed, so that the remaining children are the nonterminal yield in left to right order. Derivation trees deterministically produce a single weighted output tree.</Paragraph> <Paragraph position="3"> The derived transducer X' nicely produces derivation trees for a given input, but in explaining an observed (input/output) pair, we must restrict the possibilities further. Because the transformations of an input subtree depend only on that subtree and its state, we can (Algorithm 1) build a compact wRTG that produces exactly the weighted derivation trees corresponding to Xtransductions (I,()) =[?]X (O,h) (with weight equal to wX(h)).</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Inside-Outside for wRTG </SectionTitle> <Paragraph position="0"> Given a wRTG G = (S,N,S,P), we can compute the sums of weights of trees derived using each production by adapting the well-known inside-outside algorithm for weighted context-free (string) grammars (Lari and Young, 1990).</Paragraph> <Paragraph position="1"> The inside weights using G are given by bG : TS (R[?]R[?]), giving the sum of weights of all tree-producing derivatons from trees with nonterminal leaves:</Paragraph> <Paragraph position="3"> By definition, bG(S) gives the sum of the weights of all trees generated by G. For the wRTG generated by DERIV(X,I,O), this is exactly WX(I,O).</Paragraph> <Paragraph position="4"> Outside weights aG for a nonterminal are the sums of weights of trees generated by the wRTG that have derivations containing it, but excluding its inside weights (that is, the weights summed do not include the weights of rules used to expand an instance of it).</Paragraph> <Paragraph position="5"> aG(n [?] N) [?] 1 if n = S, else: uses of n in productionsz } |{</Paragraph> <Paragraph position="7"> Input: xR transducer X = (S,[?],Q,Qi,R) and observed tree pair I[?]TS, O[?]T[?].</Paragraph> <Paragraph position="8"> Output: derivation wRTG G = (R,N [?]QxpathsIx pathsO,S,P) generating all weighted derivation trees for X that produce O from I. Returns false instead if there are no such trees.</Paragraph> <Paragraph position="10"> The possible derivations for a given PRODUCEI,O(q,i,o) are constant and need not be computed more than once, so the function is memoized. We have in the worst case to visit all|Q|*|I|*|O| (q,i,o) pairs and have all|R|transducer rules match at each of them. If enumerating rules matching transducer input-patterns and output-subtrees has cost L (constant given a transducer), then DERIV has time complexity O(L*|Q|*|I|*|O|*|R|).</Paragraph> <Paragraph position="11"> Finally, given inside and outside weights, the sum of weights of trees using a particular production is gG((n,r,w)[?]P)[?]aG(n)*w*bG(r).</Paragraph> <Paragraph position="12"> Computing aG and bG for nonrecursive wRTG is a straightforward translation of the above recursive definitions (using memoization to compute each result only once) and is O(|G|) in time and space.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 EM Training </SectionTitle> <Paragraph position="0"> Estimation-Maximization training (Dempster, Laird, and Rubin, 1977) works on the principle that the corpus likelihood can be maximized subject to some normalization constraint on the parameters by repeatedly (1) estimating the expectation of decisions taken for all possible ways of generating the training corpus given the current parameters, accumulating parameter counts, and (2) maximizing by assigning the counts to the parameters and renormalizing. Each iteration is guaranteed to increase the likelihood until a local maximum is reached.</Paragraph> <Paragraph position="1"> Algorithm 2 implements EM xR training, repeatedly computing inside-outside weights (using fixed transducer derivation wRTGs for each input/output tree pair) to efficiently sum each parameter contribution to likelihood over all derivations. Each EM iteration takes time linear in the size of the transducer and linear in the size of the derivation tree grammars for the training examples. The size of the derivation trees is at worst O(|Q|*|I|*|O|*|R|).</Paragraph> <Paragraph position="2"> For a corpus of K examples with average input/output size M, an iteration takes (at worst) O(|Q|*|R|*K*M2) time--quadratic, like the forward-backward algorithm.</Paragraph> <Paragraph position="3"> 8 Tree-to-String Transducers (xRS) We now turn to tree-to-string transducers (xRS). In the automata literature, these were first called generalized syntax-directed translations (Aho and Ullman, 1971) and used to specify compilers. Tree-to-string transducers have also been applied to machine translation (Yamada and Knight, 2001; Eisner, 2003).</Paragraph> <Paragraph position="4"> We give an explicit tree-to-string transducer example in the next section. Formally, a weighted extended-lhs root-to-frontier tree-to-string transducer X is a quintuple (S,[?],Q,Qi,R) where S is the input alphabet, and [?] is the output alphabet, Q is a finite set of states, Qi [?] Q is the initial (or start, or root) state, and R [?] Qx XRPATSx([?][?](Qxpaths))[?]xR+ are a finite set of weighted transformation rules, written (q,pattern)-w rhs. A rule says that to transform (with weight w) an input subtree matching pattern while in state q, replace it by the string of rhs with its nonterminal (Qxpaths) letters replaced by their (recursive) transformation.</Paragraph> <Paragraph position="5"> xRS is the same as xR, except that the rhs are strings containing some nonterminals instead of trees containing nonterminal leaves (so the intermediate derivation objects Input: xR transducer X = (S,[?],Q,Qd,R), observed weighted tree pairs T [?]TSxT[?]xR+, normalization function Z({countr |r [?] R},r' [?] R), minimum relative log-likelihood change for convergence o[?]R+, maximum number of iterations maxit [?] N, and prior counts (for a so-called Dirichlet prior){priorr |r[?]R}for smoothing each rule.</Paragraph> <Paragraph position="6"> Output: New rule weights W [?]{wr |r[?]R}.</Paragraph> <Paragraph position="7"> begin for (i,o,w)[?]T do di,o-DERIV(X,i,o)//Alg. 1 if di,o = false then</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> T -T[?]{(i,o,w)} </SectionTitle> <Paragraph position="0"> warn(more rules are needed to explain (i,o)) compute inside/outside weights for di,o and remove all useless nonterminals n whose bdi,o(n) = 0 or adi,o(n) = 0 itno-0, lastL-[?][?], d-o for r = (q,pat,rhs,w)[?]R do wr -w while d[?]o[?]itno < maxit do for r[?]R do countr-priorr are strings containing state-marked input subtrees). We have developed an xRS training procedure similar to the xR procedure, with extra computational expense to consider how different productions might map to different spans of the output string. Space limitations prohibit a detailed description; we refer the reader to a longer version of this paper (submitted). We note that this algorithm subsumes normal inside-outside training of PCFG on strings (Lari and Young, 1990), since we can always fix the input tree to some constant for all training examples. null</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 9 Example </SectionTitle> <Paragraph position="0"> It is possible to cast many current probabilistic natural language models as R-type tree transducers. In this section, we implement the translation model of (Yamada and Knight, 2001). Their generative model provides a formula for P(Japanese string |English tree), in terms of individual parameters, and their appendix gives special EM re-estimation formulae for maximizing the product of these conditional probabilities across the whole tree/string corpus.</Paragraph> <Paragraph position="1"> We now build a trainable xRS tree-to-string transducer that embodies the same P(Japanese string|English tree).</Paragraph> <Paragraph position="2"> First, we need start productions like these, where q is the start state:</Paragraph> <Paragraph position="4"> These set up states like q.TOP.S, which means &quot;translate this tree, whose root is S.&quot; Then every q.parent.child pair gets its own set of three insert-function-word productions, e.g.: - q.TOP.S x - i x, r x - q.TOP.S x - r x, i x - q.TOP.S x - r x - q.NP.NN x - i x, r x - q.NP.NN x - r x, i x - q.NP.NN x - r x State i means &quot;produce a Japanese function word out of thin air.&quot; We include an i production for every Japanese word in the vocabulary, e.g.:</Paragraph> <Paragraph position="6"> State r means &quot;re-order my children and then recurse.&quot; For internal nodes, we include a production for every parent/child-sequence and every permutation thereof, e.g.:</Paragraph> <Paragraph position="8"> The rhs sends the child subtrees back to state q for recursive processing. However, for English leaf nodes, we instead transition to a different state t, so as to prohibit any subsequent Japanese function word insertion:</Paragraph> <Paragraph position="10"> State t means &quot;translate this word,&quot; and we have a production for every pair of co-occurring English and Japanese words:</Paragraph> <Paragraph position="12"> This follows (Yamada and Knight, 2001) in also allowing English words to disappear, or translate to epsilon.</Paragraph> <Paragraph position="13"> Every production in the xRS transducer has an associated weight and corresponds to exactly one of the model parameters.</Paragraph> <Paragraph position="14"> There are several benefits to this xRS formulation.</Paragraph> <Paragraph position="15"> First, it clarifies the model, in the same way that (Knight and Al-Onaizan, 1998; Kumar and Byrne, 2003) elucidate other machine translation models in easily-grasped FST terms. Second, the model can be trained with generic, off-the-shelf tools--versus the alternative of working out model-specific re-estimation formulae and implementing custom training software. Third, we can easily extend the model in interesting ways. For example, we can add productions for multi-level and lexical re-ordering: - r NP(x0:NP, PP(IN(of), x1:NP)) - q x1, no, q x0 We can add productions for phrasal translations: - r NP(JJ(big), NN(cars)) - ooki, kuruma This can now include crucial non-constituent phrasal translations: - r S(NP(PRO(there),VP(VB(are), x0:NP) - q x0, ga, arimasu We can also eliminate many epsilon word-translation rules in favor of more syntactically-controlled ones, e.g.: - r NP(DT(the),x0:NN) - q x0 We can make many such changes without modifying the training procedure, as long as we stick to tree automata.</Paragraph> </Section> class="xml-element"></Paper>