File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1021_metho.xml

Size: 8,373 bytes

Last Modified: 2025-10-06 14:14:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1021">
  <Title>A Polynomial-Time Algorithm for Statistical Machine Translation</Title>
  <Section position="5" start_page="153" end_page="154" type="metho">
    <SectionTitle>
3 BTG-Based Search for the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="153" end_page="154" type="sub_section">
      <SectionTitle>
Original Models
</SectionTitle>
      <Paragraph position="0"> A first approach to improving the translation search is to limit the allowed word alignment patterns to those permitted by a BTG. In this case, Equation (2) is kept as the objective function and the translation channel can be parameterized similarly to Dagan et al. (Dagan, Church, and Gale, 1993). The effect of the BTG restriction is just to constrain the shapes of the word-order distortions. A BTG rather than ITG is used since, as we discussed earlier, pure channel translation models operate without explicit grammars, providing no constituent categories around which a more sophisticated ITG could be structured.</Paragraph>
      <Paragraph position="1"> But the structural constraints of the BTG can improve search efficiency, even without differentiated constituent categories. Just as in the baseline system, we rely on the language and translation models to take up the slack in place of an explicit grammar.</Paragraph>
      <Paragraph position="2"> In this approach, an O(T 7) algorithm similar to the one described later can be constructed to replace A* search.</Paragraph>
      <Paragraph position="3">  However we do not feel it is worth preserving offset (or alignment or distortion) parameters simply for the sake of preserving the original translation channel model. These parameterizations were only intended to crudely model word-order variation. Instead, the BTG itself can be used directly to probabilistically rank alternative alignments, as described next.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="154" end_page="155" type="metho">
    <SectionTitle>
4 Replacing the Channel Model
</SectionTitle>
    <Paragraph position="0"> with a SBTG The second possibility is to use a stochastic bracketing transduction grammar (SBTG) in the channel model, replacing the translation model altogether. In a SBTG, a probability is associated with each production. Thus for the normal-form BTG, we have: The translation lexicon is encoded in productions of  for all x, y lexical translations for all x Chinese vocabulary for all y English vocabulary the third kind. The latter two kinds of productions allow words of either Chinese or English to go unmatched. null The SBTG assigns a probability Pr(c, e, q) to all generable trees q and sentence-pairs. In principle it can be used as the translation channel model by normalizing with Pr(e) and integrating out Pr(q) to give Pr(cle ) in Equation (2). In practice, a strong language model makes this unnecessary, so we can instead optimize the simpler Viterbi approximation</Paragraph>
    <Paragraph position="2"> To complete the picture we add a bigram model ge~-lej = g(ej lej_l) for the English language model Pr(e).</Paragraph>
    <Paragraph position="3"> Offset, alignment, or distortion parameters are entirely eliminated. A large part of the implicit function of such parameters--to prevent alignments where too many frame arguments become separated--is rendered unnecessary by the BTG's structural constraints, which prohibit many such configurations altogether. Another part of the parameters' ~urpose is subsumed by the SBTG's probabilities at\] and a0, which can be set to prefer straight or inverted orientation depending on the language pair. As in the original models, the language model heavily influences the remaining ordering decisions.</Paragraph>
    <Paragraph position="4"> Matters are complicated by the presence of the bi-gram model in the objective function (which word-alignment models, as opposed to translation models, do not need to deal with). As in our word-alignment model, the translation algorithm optimizes Equation (4) via dynamic programming, similar to chart parsing (Earley, 1970) but with a probabilistic objective function as for HMMs (Viterbi, 1967). But unlike the word-alignment model, to accommodate the bigram model we introduce indexes in the recurrence not only on subtrees over the source Chinese string, but also on the delimiting words of the target English substrings.</Paragraph>
    <Paragraph position="5"> Another feature of the algorithm is that segmentation of the Chinese input sentence is performed in parallel with the translation search. Conventional architectures for Chinese NLP generally attempt to identify word boundaries as a preprocessing stage. 5 Whenever the segmentation preprocessor prematurely commits to an inappropriate segmentation, difficulties are created for later stages. This problem is particularly acute for translation, since the decision as to whether to regard a sequence as a single unit depends on whether its components can be translated compositionally. This in turn often depends on what the target language is. In other words, the Chinese cannot be appropriately segmented except with respect to the target language of translation--a task-driven definition of correct segmentation. null The algorithm is given below. A few remarks about the notation used: c~..t denotes the subsequence of Chinese tokens cs+t, cs+2, * * * , ct. We use E(s..t) to denote the set of English words that are translations the Chinese word created by taking all tokens in c,..t together. E(s,t) denotes the set of English words that are translations of any of the Chinese words anywhere within c,..t. Note also that we assume the explicit sentence-start and sentenceend tokens co = &lt;s&gt; and CT+l = &lt;/s&gt;, which makes the algorithm description more parsimonious. Finally, the argmax operator is generalized to vector notation to accomodate multiple indices.</Paragraph>
    <Paragraph position="6">  1. Initialization o * O&lt;s&lt;t&lt;T 6~trr(~) = b~(c~..t/Y), :~ ~ E(s..-t) 2. Recursion For all s,t,y,z such that { -1_&lt;s&lt;t_&lt;T+1</Paragraph>
    <Paragraph position="8"> 3. Reconstruction Initialize by setting the root  of the parse tree to q0 = (-1, T- 1, &lt;s&gt;, &lt;/s&gt;). The remaining descendants in the optimal parse tree are then given recursively for any q = (s,t, y, z) by: a probabilistic optimization problem. But perhaps most importantly, our goal is to constrain as tightly as possible the space of possible transduction relationships between two languages with fixed wordorder, making no other language-specific assumptions; we are thus driven to seek a kind of language-universal property. In contrast, the ID/LP work was directed at parsing a single language with free word-order. As a consequence, it would be necessary to enumerate a specific set of linear-precedence (LP) relations for the language, and moreover the immediate-dominance (ID) productions would typically be more complex than binary-branching. This significantly increases time complexity, compared to our BTG model. Although it is not mentioned in their paper, the time complexity for ID/LP parsing rises exponentially with the length of production right-hand-sides, due to the number of permutations. ITGs avoid this with their restriction to inversions, rather than permutations, and BTGs further minimize the grammar size. We have also confirmed empirically that our models would not be feasible under general permutations.</Paragraph>
    <Paragraph position="10"> Assume the number of translations per word is bounded by some constant. Then the maximum size of E(s,t) is proportional to t - s. The asymptotic time complexity for the translation algorithm is thus bounded by O(T7). Note that in practice, actual performance is improved by the sparseness of the translation matrix.</Paragraph>
    <Paragraph position="11"> An interesting connection has been suggested to direct parsing for ID/LP grammars (Shieber, 1984), in which word-order variations would be accommodated by the parser, and related ideas for generation of free word-order languages in the TAG framework (Joshi, 1987). Our work differs from the ID/LP work in several important respects. First, we are not merely parsing, but translating with a bigram language model. Also, of course, we are dealing with</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML