XML Viewer - p06-2122

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2122_metho.xml
Size: 14,022 bytes
Last Modified: 2025-10-06 14:10:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2122">
  <Title>Inducing Word Alignments with Bilexical Synchronous Trees</Title>
  <Section position="4" start_page="953" end_page="954" type="metho">
    <SectionTitle>
2 Bilexicalization of Inversion
Transduction Grammar
</SectionTitle>
    <Paragraph position="0"> The Inversion Transduction Grammar of Wu (1997) models word alignment between a translation pair of sentences by assuming a binary synchronous tree on top of both sides. Using EM training, ITG can induce good alignments through exploring the hidden synchronous trees from instances of string pairs.</Paragraph>
    <Paragraph position="1"> ITG consists of unary production rules that generate English/foreign word pairs e/f:</Paragraph>
    <Paragraph position="3"> and binary production rules in two forms that generate subtree pairs, written:</Paragraph>
    <Paragraph position="5"> The square brackets indicate the right hand side rewriting order is the same for both languages.</Paragraph>
    <Paragraph position="6"> The pointed brackets indicate there exists a type of syntactic reordering such that the two right hand side constituents rewrite in the opposite order in the second language.</Paragraph>
    <Paragraph position="7"> The unary rules account for the alignment links across two sides. Either e or f may be a special null word, handling insertions and deletions. The two kinds of binary rules (called straight rules and inverted rules) build up a coherent tree structure on top of the alignment links. From a modeling perspective, the synchronous tree that may involve inversions tells a generative story behind the word level alignment.</Paragraph>
    <Paragraph position="8"> An example ITG tree for the sentence pair Je les vois / I see them is shown in Figure 1(left). The probability of the tree is the product rule probabilities at each node:</Paragraph>
    <Paragraph position="10"> The structural constraint of ITG, which is that only binary permutations are allowed on each level, has been demonstrated to be reasonable by Zens and Ney (2003) and Zhang and Gildea (2004). However, in the space of ITG-constrained synchronous trees, we still have choices in making the probabilistic distribution over the trees more realistic. The original Stochastic ITG is the counterpart of Stochastic CFG in the bitext space. The probability of an ITG parse tree is simply a product of the probabilities of the applied rules. Thus, it only captures the fundamental features of word links and re ects how often inversions occur.</Paragraph>
    <Section position="1" start_page="953" end_page="953" type="sub_section">
      <SectionTitle>
2.1 Cross-Language Bilexicalization
</SectionTitle>
      <Paragraph position="0"> Zhang and Gildea (2005) described a model in which the nonterminals are lexicalized by English and foreign language word pairs so that the inversions are dependent on lexical information on the left hand side of synchronous rules. By introducing the mechanism of probabilistic head selection there are four forms of probabilistic binary rules in the model, which are the four possibilities created by taking the cross-product of two orientations (straight and inverted) and two head choices:</Paragraph>
      <Paragraph position="2"> where (e/f) is a translation pair.</Paragraph>
      <Paragraph position="3"> A tree for our example sentence under this model is shown in Figure 1(center). The tree's probability is again the product of rule probabilities: null</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="2" start_page="953" end_page="954" type="sub_section">
      <SectionTitle>
2.2 Head-Modifier Bilexicalization
</SectionTitle>
      <Paragraph position="0"> One disadvantage of the model above is that it is not capable of modeling bilexical dependencies on the right hand side of the rules. Thus, while the probability of a production being straight or inverted depends on a bilingual word pair, it does not take head-modi er relations in either language into account. However, modeling complete bilingual bilexical dependencies as theorized in Melamed (2003) implies a huge parameter space of O(|V |2|T|2), where |V  |and |T |are the vocabulary sizes of the two languages. So, instead of modeling cross-language word translations and within-language word dependencies in  a joint fashion, we factor them apart. We lexicalize the dependencies in the synchronous tree using words from only one language and translate the words into their counterparts in the other language only at the bottom of the tree. Formally, we have the following patterns of binary dependency rules:</Paragraph>
      <Paragraph position="2"> where e is an English head and eprime is an English modi er.</Paragraph>
      <Paragraph position="3"> Equally importantly, we have the unary lexical rules that generate foreign words:</Paragraph>
      <Paragraph position="5"> To make the generative story complete, we also have a top rule that goes from the unlexicalized start symbol to the highest lexicalized nonterminal in the tree:</Paragraph>
      <Paragraph position="7"> Figure 1(right), shows our example sentence's tree under the new model. The probability of a bilexical synchronous tree between the two sentences is:</Paragraph>
      <Paragraph position="9"> Interestingly, the lexicalized B(see) predicts not only the existence of C(them), but also that there is an inversion involved going from C(see) to C(them). This re ects the fact that direct object pronouns come after the verb in English, but before the verb in French. Thus, despite conditioning on information about words from only one language, the model captures syntactic reordering information about the speci c language pair it is trained on. We are able to discriminate between the straight and inverted binary nodes in our example tree in a way that cross-language bilexicalization could not.</Paragraph>
      <Paragraph position="10"> In terms of inferencing within the framework, we do the usual Viterbi inference to nd the best bilexical synchronous tree and treat the dependencies and the alignment given by the Viterbi parse as the best ones, though mathematically the best alignment should have the highest probability marginalized over all dependencies constrained by the alignment. We do unsupervised training to obtain the parameters using EM. Both EM and Viterbi inference can be done using the dynamic programming framework of synchronous parsing.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="954" end_page="957" type="metho">
    <SectionTitle>
3 Inside-Outside Parsing with the Hook
Trick
</SectionTitle>
    <Paragraph position="0"> ITG parsing algorithm is a CYK-style chart parsing algorithm extended to bitext. Instead of building up constituents over spans on a string, an ITG chart parser builds up constituents over subcells within a cell de ned by two strings. We use b(X(e), s, t, u, v) to denote the inside probability of X(e) which is over the cell of (s, t, u, v) where (s, t) are indices into the source language string and (u, v) are indices into the target language string. We use a(X(e), s, t, u, v) to denote its outside probability. Figure 2 shows how smaller cells adjacent along diagonals can be combined to create a large cell. We number the subcells counterclockwise. To analyze the complexity of the algorithm with respect to input string  four corners for more ef cient parsing.</Paragraph>
    <Paragraph position="1"> length, without loss of generality, we ignore the nonterminal symbols X, Y , and Z to simplify the derivation.</Paragraph>
    <Paragraph position="2"> The inside algorithm in the context of bilexical ITG is based on the following dynamic programming equation:</Paragraph>
    <Paragraph position="4"> So, on the right hand side, we sum up all possible ways (S, U) of splitting the left hand side cell and all possible head words (eprime) for the non-head subcell. e, eprime, s, t, u, v, S, and U all eight variables take O(n) values given that the lengths of the source string and the target string are O(n).</Paragraph>
    <Paragraph position="5"> Thus the entire DP algorithm takes O(n8) steps.</Paragraph>
    <Paragraph position="6"> Fortunately, we can reduce the maximum number of interacting variables by factorizing the expression. null Let us keep the results of the summations over eprime as:</Paragraph>
    <Paragraph position="8"> The computation of each b+ involves four boundary indices and two head words. So, we can rely on DP to compute them in O(n6). Based on these intermediate results, we have the equivalent DP expression for computing inside probabilities:</Paragraph>
    <Paragraph position="10"> We reduced one variable from the original expression. The maximum number of interacting variables throughout the algorithm is 7. So the improved inside algorithm has a time complexity of O(n7).</Paragraph>
    <Paragraph position="11"> The trick of reducing interacting variables in DP for bilexical parsing has been pointed out by Eisner and Satta (1999). Melamed (2003) discussed the applicability of the so-called hook trick for parsing bilexical multitext grammars. The name hook is based on the observation that we combine the non-head constituent with the bilexical rule to create a special constituent that matches the head like a hook as demonstrated in Figure 2. However, for EM, it is not clear from their discussions how we can do the hook trick in the outside pass.</Paragraph>
    <Paragraph position="12"> The bilexical rules in all four directions are analogous. To simplify the derivation for the outside algorithm, we just focus on the rst case: straight rule with right head word.</Paragraph>
    <Paragraph position="13"> The outside probability of the constituent</Paragraph>
    <Paragraph position="15"> which indicates we can reuse b+ of the lower left neighbors of the head to make the computation feasible in O(n7).</Paragraph>
    <Paragraph position="16"> On the other hand, the outside probability for (eprime, s, S, u, U) in cell 3 acting as a modi er of such  a rule is:</Paragraph>
    <Paragraph position="18"> in which we memorize another kind of intermediate sum to make the computation no more complex than O(n7).</Paragraph>
    <Paragraph position="19"> We can think of a+3 as the outside probability of the hook on cell 3 which matches cell 1. Generally, we need outside probabilities for hooks in all four directions.</Paragraph>
    <Paragraph position="21"> Based on them, we can add up the outside probabilities of a constituent acting as one of the two children of each applicable rule on top of it to get the total outside probability.</Paragraph>
    <Paragraph position="22"> We nalize the derivation by simplifying the expression of the expected count of (e - [eprimee]).</Paragraph>
    <Paragraph position="24"> which can be computed in O(n6) as long as we have a+3 ready in a table. Overall we can do the inside-outside algorithm for the bilexical ITG in O(n7), by reducing a factor of n through intermediate DP.</Paragraph>
    <Paragraph position="25"> The entire trick can be understood very clearly if we imagine the bilexical rules are unary rules that are applied on top of the non-head constituents to reduce it to a virtual lexical constituent (a hook) covering the same subcell while sharing the head word with the head constituent. However, if we build hooks looking for all words in a sentence whenever a complete constituent is added to the chart, we will build many hooks that are never used, considering that the words outside of larger cells are fewer and pruning might further reduce the possible outside words. Blind guessing of what might appear outside of the current cell will offset the saving we can achieve. Instead of actively building hooks, which are intermediate results, we can build them only when we need them and then cache them for future use. So the construction of the hooks will be invoked by the heads when the heads need to combine with adjacent cells.</Paragraph>
    <Section position="1" start_page="956" end_page="957" type="sub_section">
      <SectionTitle>
3.1 Pruning and Smoothing
</SectionTitle>
      <Paragraph position="0"> We apply one of the pruning techniques used in Zhang and Gildea (2005). The technique is general enough to be applicable to any parsing algorithm over bitext cells. It is called tic-tac-toe pruning since it involves an estimate of both the inside probability of the cell (how likely the words within the box in both dimensions are to align) and the outside probability (how likely the words outside the box in both dimensions are to align). By scoring the bitext cells and throwing away the bad cells that fall out of a beam, it can reduce over 70% of O(n4) cells using 10[?]5 as the beam ratio for sentences up to 25 words in the experiments, without harming alignment error rate, at least for the unlexicalized ITG.</Paragraph>
      <Paragraph position="1"> The hook trick reduces the complexity of bilexical ITG from O(n8) to O(n7). With the tic-tac-toe pruning reducing the number of bitext cells to work with, also due to the reason that the grammar constant is very small for ITG. the parsing algorithm runs with an acceptable speed, The probabilistic model has lots of parameters of word pairs. Namely, there are O(|V |2) dependency probabilities and O(|V ||T|) translation probabilities, where |V  |is the size of English vocabulary and |T |is the size of the foreign language vocabulary. The translation probabilities of P(f|X(e)) are backed off to a uniform distribution. We let the bilexical dependency probabilities back off to uni-lexical dependencies in the following forms:</Paragraph>
      <Paragraph position="3"> bitext cells before parsing begins.</Paragraph>
      <Paragraph position="4"> The two levels of distributions are interpolated using a technique inspired by Witten-Bell smoothing (Chen and Goodman, 1996). We use the expected count of the left hand side lexical nonterminal to adjust the weight for the EM-trained bilexical probability. For example,</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML