File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3601_metho.xml

Size: 11,608 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3601">
  <Title>A Syntax-Directed Translator with Extended Domain of Locality</Title>
  <Section position="4" start_page="3" end_page="4" type="metho">
    <SectionTitle>
3 Extended Tree-to-String Tranducers
</SectionTitle>
    <Paragraph position="0"> In this section, we define the formal machinery of our recursive transformation model as a special case of xRs transducers (Graehl and Knight, 2004) that has only one state, and each rule is linear (L) and non-deleting (N) with regarding to variables in the source and target sides (henth the name 1-xRLNs).</Paragraph>
    <Paragraph position="1">  Definition 1. A 1-xRLNs transducer is a tuple (N,S,[?],R) where N is the set of nonterminals, S is the input alphabet, [?] is the output alphabet, and R is a set of rules. A rule in R is a tuple (t,s,ph) where: 1. t is the LHS tree, whose internal nodes are labeled by nonterminal symbols, and whose frontier nodes are labeled terminals from S or variables from a setX ={x1,x2,...}; 2. s[?](X[?][?])[?] is the RHS string; 3. ph is a mapping fromX to nonterminals N.</Paragraph>
    <Paragraph position="2">  We require each variable xi[?]X occurs exactly once in t and exactly once in s (linear and non-deleting). We denote r(t) to be the root symbol of tree t. When writing these rules, we avoid notational overhead by introducing a short-hand form from Galley et al. (2004) that integrates the mapping into the tree, which is used throughout Section 1. Following TSG terminology (see Figure 2), we call these &amp;quot;variable nodes&amp;quot; such as x2:NP-C substitution nodes, since when applying a rule to a tree, these nodes will be matched with a sub-tree with the same root symbol. We also define|X|to be the rank of the rule, i.e., the number of variables in it. For example, rules r1 and r3 in Section 1 are both of rank 2. If a rule has no variable, i.e., it is of rank zero, then it is called a purely lexical rule, which performs a phrasal translation as in phrase-based models. Rule r2, for instance, can be thought of as a phrase pair&lt;the gunman, qiangshou&gt; .</Paragraph>
    <Paragraph position="3"> Informally speaking, a derivation in a transducer is a sequence of steps converting a source-language 2Although hybrid approaches, such as dependency grammars augmented with phrase-structure information (Alshawi et al., 2000), can do re-ordering easily.</Paragraph>
    <Paragraph position="5"> derviation producing the same output by replacing r3 with r6 and r7, which provides another way of translating the passive construction: (r6) VP ( VBD (was) VP-C (x1:VBN x2:PP ) )-x2 x1 (r7) PP ( IN (by) x1:NP-C )-bei x1 tree into a target-language string, with each step applying one tranduction rule. However, it can also be formalized as a tree, following the notion of derivation-tree in TAG (Joshi and Schabes, 1997): Definition 2. A derivation d, its source and target projections, noted E(d) and C(d) respectively, are recursively defined as follows: 1. If r = (t,s,ph) is a purely lexical rule (ph =[?]), then d = r is a derivation, whereE(d) = t and C(d) = s; 2. If r = (t,s,ph) is a rule, and di is a (sub-)  derivation with the root symbol of its source projection matches the corresponding substitution node in r, i.e., r(E(di)) = ph(xi), then d = r(d1,...,dm) is also a derivation, where E(d) = [xi mapsto- E(di)]t and C(d) = [xi mapstoC(di)]s. null Note that we use a short-hand notation [ximapsto-yi]t to denote the result of substituting each xi with yi in t, where xi ranges over all variables in t.</Paragraph>
    <Paragraph position="6"> For example, Figure 4 shows two derivations for the sentence pair in Example (1). In both cases, the source projection is the English tree in Figure 3 (b), and the target projection is the Chinese translation. Galley et al. (2004) presents a linear-time algorithm for automatic extraction of these xRs rules from a parallel corpora with word-alignment and parse-trees on the source-side, which will be used in our experiments in Section 6.</Paragraph>
  </Section>
  <Section position="5" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4 Probability Models
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Direct Model
</SectionTitle>
      <Paragraph position="0"> Departing from the conventional noisy-channel approach of Brown et al. (1993), our basic model is a</Paragraph>
      <Paragraph position="2"> where e is the English input string and c[?] is the best Chinese translation according to the translation model Pr(c  |e). We now marginalize over all English parse treesT(e) that yield the sentence e:</Paragraph>
      <Paragraph position="4"> Rather than taking the sum, we pick the best tree t[?] and factors the search into two separate steps: pars-</Paragraph>
      <Paragraph position="6"> In this sense, our approach can be considered as a Viterbi approximation of the computationally expensive joint search using (3) directly. Similarly, we now marginalize over all derivations</Paragraph>
      <Paragraph position="8"> that translates English tree t into some Chinese string and apply the Viterbi approximation again to search for the best derivation d[?]:</Paragraph>
      <Paragraph position="10"> Assuming different rules in a derivation are applied independently, we approximate Pr(d) as</Paragraph>
      <Paragraph position="12"> where the probability Pr(r) of the rule r is estimated by conditioning on the root symbol r(t(r)):</Paragraph>
      <Paragraph position="14"> where c(r) is the count (or frequency) of rule r in the training data.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Log-Linear Model
</SectionTitle>
      <Paragraph position="0"> Following Och and Ney (2002), we extend the direct model into a general log-linear framework in order to incorporate other features:</Paragraph>
      <Paragraph position="2"> where Pr(c) is the language model and e[?]l|c |is the length penalty term based on |c|, the length of the translation. Parameters a, b, and l are the weights of relevant features. Note that positive l prefers longer translations. We use a standard trigram model for Pr(c).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="5" type="metho">
    <SectionTitle>
5 Search Algorithms
</SectionTitle>
    <Paragraph position="0"> We first present a linear-time algorithm for searching the best derivation under the direct model, and then extend it to the log-linear case by a new variant of k-best parsing.</Paragraph>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
5.1 Direct Model: Memoized Recursion
</SectionTitle>
      <Paragraph position="0"> Since our probability model is not based on the noisy channel, we do not call our search module a &amp;quot;decoder&amp;quot; as in most statistical MT work. Instead, readers who speak English but not Chinese can view it as an &amp;quot;encoder&amp;quot; (or encryptor), which corresponds exactly to our direct model.</Paragraph>
      <Paragraph position="1"> Given a fixed parse-tree t[?], we are to search for the best derivation with the highest probability.</Paragraph>
      <Paragraph position="2"> This can be done by a simple top-down traversal (or depth-first search) from the root of t[?]: at each node e in t[?], try each possible rule r whose Englishside pattern t(r) matches the subtree t[?]e rooted at e, and recursively visit each descendant node ei in t[?]e that corresponds to a variable in t(r). We then collect the resulting target-language strings and plug them into the Chinese-side s(r) of rule r, getting a translation for the subtree t[?]e . We finally take the best of all translations.</Paragraph>
      <Paragraph position="3"> With the extended LHS of our transducer, there may be many different rules applicable at one tree node. For example, consider the VP subtree in Fig. 3 (c), where both r3 and r6 can apply. As a result, the number of derivations is exponential in the size of the tree, since there are exponentially many  decompositions of the tree for a given set of rules.</Paragraph>
      <Paragraph position="4"> This problem can be solved by memoization (Cormen et al., 2001): we cache each subtree that has been visited before, so that every tree node is visited at most once. This results in a dynamic programming algorithm that is guaranteed to run in O(npq) time where n is the size of the parse tree, p is the maximum number of rules applicable to one tree node, and q is the maximum size of an applicable rule. For a given rule-set, this algorithm runs in time linear to the length of the input sentence, since p and q are considered grammar constants, and n is proportional to the input length. The full pseudo-code is worked out in Algorithm 1. A restricted version of this algorithm first appears in compiling for optimal code generation from expression-trees (Aho and Johnson, 1976). In computational linguistics, the bottom-up version of this algorithm resembles the tree parsing algorithm for TSG by Eisner (2003). Similar algorithms have also been proposed for dependency-based translation (Lin, 2004; Ding and Palmer, 2005).</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Log-linear Model: k-best Search
</SectionTitle>
      <Paragraph position="0"> Under the log-linear model, one still prefers to search for the globally best derivation d[?]:</Paragraph>
      <Paragraph position="2"> However, integrating the n-gram model with the translation model in the search is computationally very expensive. As a standard alternative, rather than aiming at the exact best derivation, we search for top-k derivations under the direct model using Algorithm 1, and then rerank the k-best list with the language model and length penalty.</Paragraph>
      <Paragraph position="3"> Like other instances of dynamic programming, Algorithm 1 can be viewed as a hypergraph search problem. To this end, we use an efficient algorithm by Huang and Chiang (2005, Algorithm 3) that solves the general k-best derivations problem in monotonic hypergraphs. It consists of a normal forward phase for the 1-best derivation and a recursive backward phase for the 2nd, 3rd, . . . , kth derivations. null Unfortunately, different derivations may have the same yield (a problem called spurious ambiguity), due to multi-level LHS of our rules. In practice, this results in a very small ratio of unique strings among top-k derivations. To alleviate this problem, determinization techniques have been proposed by Mohri and Riley (2002) for finite-state automata and extended to tree automata by May and Knight (2006).</Paragraph>
      <Paragraph position="4"> These methods eliminate spurious ambiguity by effectively transforming the grammar into an equivalent deterministic form. However, this transformation often leads to a blow-up in forest size, which is exponential to the original size in the worst-case.</Paragraph>
      <Paragraph position="5"> So instead of determinization, here we present a simple-yet-effective extension to the Algorithm 3 of Huang and Chiang (2005) that guarantees to output unique translated strings: * keep a hash-table of unique strings at each vertex in the hypergraph * when asking for the next-best derivation of a vertex, keep asking until we get a new string, and then add it into the hash-table This method should work in general for any equivalence relation (say, same derived tree) that can be defined on derivations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML