File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1627_metho.xml

Size: 17,174 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1627">
  <Title>Efficient Search for Inversion Transduction Grammar</Title>
  <Section position="4" start_page="0" end_page="224" type="metho">
    <SectionTitle>
2 Inversion Transduction Grammar
</SectionTitle>
    <Paragraph position="0"> An Inversion Transduction Grammar can generate pairs of sentences in two languages by recursively applying context-free bilingual production rules.</Paragraph>
    <Paragraph position="1"> Most work on ITG has focused on the 2-normal form, which consists of unary production rules that are responsible for generating word pairs:</Paragraph>
    <Paragraph position="3"> and binary production rules in two forms that are responsible for generating syntactic subtree pairs:</Paragraph>
    <Paragraph position="5"> The rules with square brackets enclosing the right hand side expand the left hand side symbol into the two symbols on the right hand side in the same order in the two languages, whereas the rules with pointed brackets expand the left hand side symbol into the two right hand side symbols in reverse order in the two languages.</Paragraph>
  </Section>
  <Section position="5" start_page="224" end_page="226" type="metho">
    <SectionTitle>
3 A* Viterbi Alignment Selection
</SectionTitle>
    <Paragraph position="0"> A* parsing is a special case of agenda-based chart parsing, where the priority of a node X[i, j] on the agenda, corresponding to nonterminal X spanning positions i through j, is the product of the node's current inside probability with an estimate of the outside probability. By the current inside probability, we mean the probability of the so-far-mostprobable subtree rooted on the node X[i, j], with leaves being iwj, while the outside probability is the highest probability for a parse with the root being S[0, N] and the sequence 0wiXjwn forming the leaves. The node with the highest priority is removed from the agenda and added to the chart, and then explored by combining with all of its neighboring nodes in the chart to update the priorities of the resulting nodes on the agenda. By using estimates close to the actual outside probabilities, A* parsing can effectively reduce the number of nodes to be explored before putting the root node onto the chart. When the outside estimate is both admissible and monotonic, whenever a node is put onto the chart, its current best inside parse is the Viterbi inside parse.</Paragraph>
    <Paragraph position="1"> To relate A* parsing with A* search for nding the lowest cost path from a certain source node to a certain destination node in a graph, we view the forest of all parse trees as a hypergraph.</Paragraph>
    <Paragraph position="2"> The source node in the hypergraph fans out into the nodes of unit spans that cover the individual words. From each group of children to their parent in the forest, there is a hyperedge. The destination node is the common root node for all the parse trees in the forest. Under the mapping, a parse is a hyperpath from the source node to the destination node. The Viterbi parse selection problem thus becomes nding the lowest-cost hyperpath from the source node to the destination node. The cost in this scenario is thus the negative of log probability. The inside estimate and outside estimate naturally correspond to the ^g and ^h for A* searching, respectively.</Paragraph>
    <Paragraph position="3"> A stochastic ITG can be thought of as a stochastic CFG extended to the space of bitext. A node in the ITG chart is a bitext cell that covers a source substring and a target substring. We use the notion of X[l, m, i, j] to represent a tree node in ITG parse. It can potentially be combined with any bitext cells at the four corners, as shown in Figure 1(a).</Paragraph>
    <Paragraph position="4"> Unlike CFG parsing where the leaves are xed, the Viterbi ITG parse selection involves nding the Viterbi alignment under ITG constraint. Good outside estimates have to bound the outside ITG Viterbi alignment probability tightly.</Paragraph>
    <Section position="1" start_page="224" end_page="226" type="sub_section">
      <SectionTitle>
3.1 A* Estimates for Alignment
</SectionTitle>
      <Paragraph position="0"> Under the ITG constraints, each source language word can be aligned with at most one target language word and vice versa. An ITG constituent X[l, m, i, j] implies that the words in the source substring in the span [l, m] are aligned with the words in the target substring [i, j]. It further implies that the words outside the span [l, m] in the source are aligned with the words outside the span [i, j] in the target language. Figure 1(b) displays the tic-tac-toe pattern for the inside and outside components of a particular cell. To estimate the upper bound of the ITG Viterbi alignment probability for the outside component with acceptable complexity, we need to relax the ITG constraint.</Paragraph>
      <Paragraph position="1"> Instead of ensuring one-to-one in both directions, we use a many-to-one constraint in one direction, and we relax all constraints on reordering within the outside component.</Paragraph>
      <Paragraph position="2"> The many-to-one constraint has the same dynamic programming structure as IBM Model 1, where each target word is supposed to be translated from any of the source words or the NULL symbol. In the Model 1 estimate of the outside probability, source and target words can align using any combination of points from the four outside corners of the tic-tac-toe pattern. Thus in Figure 1(b), there is one solid cell (corresponding to the Model 1 Viterbi alignment) in each column, falling either in the upper or lower outside shaded corner. This can be also be thought of as squeezing together the four outside corners, creat-</Paragraph>
      <Paragraph position="4"> adjacent cells in the four outside corners (lighter shading) to expand into larger cells. One possible expansion to the lower left corner is displayed. (b) The tic-tac-toe pattern of alignments consistent with a given cell. If the inner box is used in the nal synchronous parse, all other alignments must come from the four outside corners. (c) Combination of two adjacent cells shown with region for new outside heuristic.</Paragraph>
      <Paragraph position="5"> ing a new cell whose probability is estimated using IBM Model 1. In contrast, the inside Viterbi alignment satis es the ITG constraint, implying only one solid cell in each column and each row. Mathematically, our Model 1 estimate for the outside component is:</Paragraph>
      <Paragraph position="7"> This Model 1 estimate is admissible. Maximizing over each column ensures that the translation probability for each target word is greater than or equal to the corresponding word translation probability under the ITG constraint. Model 1 virtually assigns a probability of 1 for deleting any source word. As a product of word-to-word translation probabilities including deletions and insertions, the ITG Viterbi alignment probability cannot be higher than the product of maximal word-to-word translation probabilities using the Model 1 estimate. null The Model 1 estimate is also monotonic, a prop-erty which is best understood geometrically. A successor state to cell (l, m, i, j) in the search is formed by combining the cell with a cell which is adjacent at one of the four corners, as shown in Figure 1(c). Of the four outside corner regions used in calculating the search heuristic, one will be the same for the successor state, and three will be a subset of the old corner region. Without loss of generality, assume we are combining a cell (m, n, j, k) that is adjacent to (l, m, i, j) to the upper right. We de ne HM1(l, m, i, j) = [?]log hM1(l, m, i, j) as the negative log of the heuristic in order to correspond to an estimated cost or distance in search terminology. Similarly, we speak of the cost of a chart entry c(X[l, m, i, j]) as its negative log probability, and the cost of a cell c(l, m, i, j) as the cost of the best chart entry with the boundaries (l, m, i, j). The cost of the cell (m, n, j, k) which is being combined with the old cell is guaranteed to be greater than the contribution of the columns j through k to the heuristic HM1(l, m, i, j). The contribution of the columns k through N to the new heuristic HM1(l, n, i, k) is guaranteed to be greater in cost than their contribution to the old heuristic. Thus, HM1(l, m, i, j) [?] c(m, n, j, k) + c(X - Y Z) + HM1(l, n, i, k) meaning that the heuristic is monotonic or consistent. null The Model 1 estimate can be applied in both translation directions. The estimates from both directions are an upper bound of the actual ITG Viterbi probability. By taking the minimum of the two, we can get a tighter upper bound.</Paragraph>
      <Paragraph position="8"> We can precompute the Model 1 outside estimate for all bitext cells before parsing starts. A nacurrency1 ve implementation would take O(n6) steps of computation, because there are O(n4) cells, each of which takes O(n2) steps to compute its Model 1 probability. Fortunately, exploiting the recursive  on the top is the Viterbi translation of the sentence on the bottom. Wide range word order change may happen.</Paragraph>
      <Paragraph position="9"> nature of the cells, we can compute values for the inside and outside components of each cell using dynamic programming in O(n4) time (Zhang and Gildea, 2005).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="226" end_page="228" type="metho">
    <SectionTitle>
4 A* Decoding
</SectionTitle>
    <Paragraph position="0"> The of ITG decoding algorithm of Wu (1996) can be viewed as a variant of the Viterbi parsing algorithm for alignment selection. The task of standard alignment is to nd word level links between two xed-order strings. In the decoding situation, while the input side is a xed sequence of words, the output side is a bag of words to be linked with the input words and then reordered. Under the ITG constraint, if the target language substring [i, j] is translated into s1 in the source language and the target substring [j, k] is translated into s2, then s1 and s2 must be consecutive in the source language as well and two possible orderings, s1s2 and s2s1, are allowed. Finding the best translation of the substring of [i, k] involves searching over all possible split points j and two possible reorderings for each split. In theory, the inversion probabilities associated with the ITG rules can do the job of reordering. However, a language model as simple as bigram is generally stronger. Using an n-gram language model implies keeping at least n[?]1 boundary words in the dynamic programming table for a hypothetical translation of a source language substring. In the case of a bigram ITG decoder, a translation hypothesis for the source language sub-string [i, j] is denoted as X[i, j, u, v], where u and v are the left boundary word and right boundary word of the target language counterpart.</Paragraph>
    <Paragraph position="1"> As indicated by the similarity of parsing item notation, the dynamic programming property of the Viterbi decoder is essentially the same as the bitext parsing for nding the underlying Viterbi alignment. By permitting translation from the null target string of [i, i] into source language words as many times as necessary, the decoder can translate an input sentence into a longer output sentence.</Paragraph>
    <Paragraph position="2"> When there is the null symbol in the bag of candidate words, the decoder can choose to translate a word into null to decrease the output length. Both insertions and deletions are special cases of the bi-text parsing items.</Paragraph>
    <Paragraph position="3"> Given the similarity of the dynamic programming framework to the alignment problem, it is not surprising that A* search can also be applied in a similar way. The initial parsing items on the agenda are the basic translation units: X[i, i + 1, u, u], for normal word-for-word translations and deletions (translations into nothing), and also X[i, i, u, u], for insertions (translations from nothing). The goal item is S[0, N,&lt;s&gt; ,&lt;/s&gt; ], where &lt;s&gt; stands for the beginning-of-sentence symbol and &lt;/s&gt; stands for the end-of-sentence symbol. The exploration step of the A* search is to expand the translation hypothesis of a sub-string by combining with neighboring translation hypotheses. When the outside estimate is admissible and monotonic, the exploration is optimal in the sense that whenever a hypothesis is taken from the top of the agenda, it is a Viterbi translation of the corresponding target substring. Thus, when S[0, N,&lt;s&gt; ,&lt;/s&gt; ] is added to the chart, we have found the Viterbi translation for the entire sentence.</Paragraph>
    <Paragraph position="4">  b(X[i, j, u, v]) = max braceleftbigb&lt;&gt; (X[i, j, u, v]), b[](X[i, j, u, v])bracerightbig b[](X[i, j, u, v]) = max k,v1,u2,Y,Z bracketleftBig b(Y [i, k, u, v1]) * b(Z[k, j, u2, v]) * P(X - [Y Z]) * Plm(u2  |v1)  Bottom: An ef cient factorization for straight rules.</Paragraph>
    <Section position="1" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
4.1 A* Estimates for Translation
</SectionTitle>
      <Paragraph position="0"> The key to the success of A* decoding is an outside estimate that combines word-for-word translation probabilities and n-gram probabilities. Figure 2 is the picture of the outside translations and bigrams of a particular translation hypothesis X[i, j, u, v].</Paragraph>
      <Paragraph position="1"> Our heuristic involves precomputing two values for each word in the input string, involving forward- and backward-looking language model probabilities. For the forward looking value hf at input position n, we take a maximum over the set of words Sn that the input word tn can be translated as:</Paragraph>
      <Paragraph position="3"> is the set of all possible translations for all words in the input string. While hf considers language model probabilities for words following s, the backward-looking value hb considers language model probabilities for s given possible preceding words:</Paragraph>
      <Paragraph position="5"> Our overall heuristic for a partial translation hypothesis X[i, j, u, v] combines language model probabilities at the boundaries of the input sub-string with backward-looking values for the preceding words, and forward-looking values for the following words:</Paragraph>
      <Paragraph position="7"> Because we don't know whether a given input word will appear before or after the partial hypothesis in the nal translation, we take the maximum of the forward and backward values for words outside the span [i, j].</Paragraph>
    </Section>
    <Section position="2" start_page="227" end_page="228" type="sub_section">
      <SectionTitle>
4.2 Combining the Hook Trick with A*
</SectionTitle>
      <Paragraph position="0"> The hook trick is a factorization technique for dynamic programming. For bilexical parsing, Eisner and Satta (1999) pointed out we can reduce the complexity of parsing from O(n5) to O(n4) by combining the non-head constituents with the bilexical rules rst, and then combining the resultant hook constituents with the head constituents. By doing so, the maximal number of interactive variables ranging over n is reduced from 5 to 4.</Paragraph>
      <Paragraph position="1"> For ITG decoding, we can apply a similar factorization trick. We describe the bigram-integrated decoding case here, and refer to Huang et al.</Paragraph>
      <Paragraph position="2"> (2005) for more detailed discussion. Figure 3 shows how to decompose the expression for the case of straight rules; the same method applies to inverted rules. The number of free variables on the right hand side of the second equation is 7: i, j, k, u, v, v1, and u2.1 After factorization, counting the free variables enclosed in the innermost max operator, we get ve: i, k, u, v1, and u2. The decomposition eliminates one free variable, v1. In the outermost level, there are six free variables left. The maximum number of interacting variables is six overall. So, we reduced the complexity of ITG decoding using bigram language model from O(n7) to O(n6). If we visualize an ITG decoding constituent Y extending from source language position i to k and target language boundary words u and v1 with a diagram:</Paragraph>
      <Paragraph position="4"> 1X,Y , andZ range over grammar nonterminals, of which there are a constant number.</Paragraph>
      <Paragraph position="5">  the hook corresponding to the innermost max operator in the equation can be visualized as follows:</Paragraph>
      <Paragraph position="7"> with the expected language model state u2 hanging outside the target language string.</Paragraph>
      <Paragraph position="8"> The trick is generic to the control strategies of actual parsing, because the hooks can be treated as just another type of constituent. Building hooks is like applying special unary rules on top of nonhooks. In terms of of outside heuristic for hooks, there is a slight difference from that for non-hooks: h(i, j, u, v) =  max [hb(n), hf(n)] That is, we do not need the backward-looking estimate for the left boundary word u.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML