File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1606_metho.xml

Size: 18,990 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1606">
  <Title>SPMT: Statistical Machine Translation with Syntactified Target Language Phrases</Title>
  <Section position="5" start_page="44" end_page="45" type="metho">
    <SectionTitle>
THESE 7PEOPLE INCLUDE COMINGFROM
FRANCE AND RUSSIA p-DE ASTRO- -NAUTS .
</SectionTitle>
    <Paragraph position="0"> from france and russia . (3) One method for increasing the ability of a decoder to reorder target language phrases is that of decorating them with syntactic constituent information. For example, we may make explicit that the Chinese phrase &amp;quot;ASTRO- -NAUTS&amp;quot; may be translated into English as a noun phrase, NP(NNS(astronauts)); that the phrase FRANCE AND RUSSIA may be translated into a complex nounphrase, NP(NP(NNP(france)) CC(and) NP(NNP(russia))); that the phrase COMINGFROM may be translated into a partially realized verb phrase that is looking for a noun phrase to its right in order to be fully realized, VP(VBG(coming) PP(IN(from) NP:x0)); and that the Chinese particle p-DE, when occurring between a Chinese string that was translated into a verb phrase to its left and another Chinese string that was translated into a noun phrase to its right, VP:x1 p-DE NP:x0, should be translated to nothing, while forcing the reordering of the two constituents, NP(NP:x0, VP:x1). If all these translation rules (labeled r1 to r4 in Figure 1) were available to a decoder that derives English parse trees starting from Chinese input strings, this decoder could produce derivations such as that shown in Figure 2. Because our approach uses translation rules with Syntactified target language Phrases (see Figure 1), we call it SPMT.</Paragraph>
    <Section position="1" start_page="44" end_page="45" type="sub_section">
      <SectionTitle>
2.2 A formal introduction to SPMT
</SectionTitle>
      <Paragraph position="0"> We are interested to model a generative process that explains how English parse trees pi and their associated English string yields E, foreign sentences, F, and word-level alignments, A, are produced. We assume that observed (pi,F,A) triplets are generated by a stochastic process similar to</Paragraph>
      <Paragraph position="2"> nema, 2002). For example, if we assume that the generative process has already produced the top NP node in Figure 2, then the corresponding partial English parse tree, foreign/source string, and word-level alignment could be generated by the rule derivation r4(r1,r3(r2)), where each rule is assumed to have some probability.</Paragraph>
      <Paragraph position="3"> The extended tree to string transducers introduced by Knight and Graehl (2005) provide a natural framework for expressing the tree to string transformations specific to our SPMT models.</Paragraph>
      <Paragraph position="4"> The transformation rules we plan to exploit are equivalent to one-state xRS top-down transducers with look ahead, which map subtree patterns to strings. For example, rule r3 in Figure 1 can be applied only when one is in a state that has a VP as its syntactic constituent and the tree pattern VP(VBG(coming) PP(IN(from) NP)) immediately underneath. The rule application outputs the string &amp;quot;COMINGFROM&amp;quot; as the transducer moves to the state co-indexed by x0; the outputs produced from the new state will be concatenated to the right of the string &amp;quot;COMINGFROM&amp;quot;.</Paragraph>
      <Paragraph position="5"> Since there are multiple derivations that could lead to the same outcome, the probability of a tuple (pi,F,A) is obtained by summing over all derivations thi 2 Th that are consistent with the tu- null nese string COMINGFROM FRANCE AND RUSSIA p-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="45" end_page="48" type="metho">
    <SectionTitle>
DE ASTRO- -NAUTS.
</SectionTitle>
    <Paragraph position="0"> ple, c(Th) = (pi,F,A). The probability of each derivation thi is given by the product of the probabilities of all the rules p(rj) in the derivation (see equation 4).</Paragraph>
    <Paragraph position="2"> In order to acquire the rules specific to our model and to induce their probabilities, we parse the English side of our corpus with an in-house implementation (Soricut, 2005) of Collins parsing models (Collins, 2003) and we word-align the parallel corpus with the Giza++2 implementation of the IBM models (Brown et al., 1993). We use the automatically derived hEnglish-parse-tree, English-sentence, Foreign-sentence, Word-levelalignmenti tuples in order to induce xRS rules for several models.</Paragraph>
    <Paragraph position="3">  In our simplest model, we assume that each tuple (pi,F,A) in our automatically annotated corpus could be produced by applying a combination of minimally syntactified, lexicalized, phrase-based compatible xRS rules, and minimal/necessary, non-lexicalized xRS rules. We call a rule non-lexicalized whenever it does not have any directly aligned source-to-target words. Rules r9-r12 in Figure 1 are examples of non-lexicalized rules.</Paragraph>
    <Paragraph position="4"> Minimally syntactified, lexicalized, phrase-based-compatible xRS rules are extracted via a  simple algorithm that finds for each foreign phrase Fji , the smallest xRS rule that is consistent with the foreign phrase F ji , the English syntactic tree pi, and the alignment A. The algorithm finds for each foreign/source phrase span its projected span on the English side and then traverses the English parse tree bottom up until it finds a node that subsumes the projected span. If this node has children that fall outside the projected span, then those children give rise to rules that have variables. For example, if the tuple shown in Figure 2 is in our training corpus, for the foreign/source phrases FRANCE, FRANCE AND, FRANCE AND RUSSIA, and ASTRO- -NAUTS, we extract the minimally syntactified, lexicalized phrase-based-compatible xRS rules r5,r6,r2, and r7 in Figure 1, respectively. Because, as in phrase-based MT, all our rules have continuous phrases on both the source and target language sides, we call these phrase-based compatible xRS rules.</Paragraph>
    <Paragraph position="5"> Since these lexicalized rules are not sufficient to explain an entire (pi,F,A) tuple, we also extract the required minimal/necessary, non-lexicalized xRS rules. The minimal non-lexicalized rules that are licensed by the tuple in Figure 2 are labeled r4,r9,r10,r11 and r12 in Figure 1. To obtain the non-lexicalized xRS rules, we compute the set of all minimal rules (lexicalized and non-lexicalized) by applying the algorithm proposed by Galley et al. (2006) and then remove the lexicalized rules.</Paragraph>
    <Paragraph position="6"> We remove the Galley et al.'s lexicalized rules because they are either already accounted for by the minimally syntactified, lexicalized, phrase-based-compatible xRS rules or they subsume non-continuous source-target phrase pairs.</Paragraph>
    <Paragraph position="7"> It is worth mentioning that, in our framework, a rule is defined to be &amp;quot;minimal&amp;quot; with respect to a foreign/source language phrase, i.e., it is the minimal xRS rule that yields that source phrase. In contrast, in the work of Galley et al. (2004; 2006), a rule is defined to be minimal when it is necessary in order to explain a (pi,F,A) tuple.</Paragraph>
    <Paragraph position="8"> Under SPMT model 1, the tree in Figure 2 can be produced, for example, by the following derivation: r4(r9(r7),r3(r6(r12(r8)))).</Paragraph>
    <Paragraph position="9">  We hypothesize that composed rules, i.e., rules that can be decomposed via the application of a sequence of Model 1 rules may improve the performance of an SPMT system. For example, although the minimal Model 1 rules r11 and r13 are  sufficient for building an English NP on top of two NPs separated by the Chinese conjunction AND, the composed rule r14 in Figure 1 accomplishes the same result in only one step. We hope that the composed rules could play in SPMT the same role that phrases play in string-based translation models. null To test our hypothesis, we modify our rule extraction algorithm so that for every foreign phrase Fji , we extract not only a minimally syntactified, lexicalized xRS rule, but also one composed rule. The composed rule is obtained by extracting the rule licensed by the foreign/source phrase, alignment, English parse tree, and the first multi-child ancestor node of the root of the minimal rule. Our intuition is that composed rules that involve the application of more than two minimal rules are not reliable. For example, for the tuple in Figure 2, the composed rule that we extract given the foreign phrases AND and COMINGFROM are respectively labeled as rules r14 and r15 in Figure 1. Under the SPMT composed model 1, the tree in Figure 2 can be produced, for example, by the following derivation: r15(r9(r7),r14(r12(r5),r12(r8))).</Paragraph>
    <Paragraph position="10">  In many instances, the tuples (pi,F,A) in our training corpus exhibit alignment patterns that can be easily handled within a phrase-based SMT framework, but that become problematic in the SPMT models discussed until now.</Paragraph>
    <Paragraph position="11"> Consider, for example, the (pi,F,A) tuple fragment in Figure 3. When using a phrase-based translation model, one can easily extract the phrase pair (THE MUTUAL; the mutual) and use it during the phrase-based model estimation phrase and in decoding. However, within the xRS transducer framework that we use, it is impossible to extract an equivalent syntactified phrase translation rule that subsumes the same phrase pair because valid xRS translation rules cannot be multiheaded. When faced with this constraint, one has several options: One can label such phrase pairs as non-syntactifiable and ignore them. Unfortunately, this is a lossy choice. On our parallel English-Chinese corpus, we have found that approximately 28% of the foreign/source phrases are non-syntactifiable by this definition. null One can also traverse the parse tree upwards until one reaches a node that is xRS valid, i.e., a node that subsumes the entire English span induced by a foreign/source phrase and the corresponding word-level alignment. This choice is also inappropriate because phrase pairs that are usually available to phrase-based translation systems are then expanded and made available in the SPTM models only in larger applicability contexts.</Paragraph>
    <Paragraph position="12"> A third option is to create xRS compatible translation rules that overcome this constraint. null Our SPMT Model 2 adopts the third option by rewriting on the fly the English parse tree for each foreign/source phrase and alignment that lead to non-syntactifiable phrase pairs. The rewriting process adds new rules to those that can be created under the SPMT model 1 constraints. The process creates one xRS rule that is headed by a pseudo, non-syntactic nonterminal symbol that subsumes the target phrase and corresponding multi-headed syntactic structure; and one sibling xRS rule that explains how the non-syntactic nonterminal symbol can be combined with other genuine nonterminals in order to obtain genuine parse trees. In this view, the foreign/source phrase THE MUTUAL and corresponding alignment in Figure 3 licenses the rules starNPBstar NN(DT(the) JJ(mutual)) - THE MUTUAL and NPB(starNPBstar NN:x0 NN:x1) - x0 x1 even though the foreign word UNDERSTANDING is aligned to an English word outside the NPB consituent. The name of the non-syntactic nonterminal reflects the intuition that the English phrase &amp;quot;the mutual&amp;quot; corresponds to a partially realized NPB that needs an NN to its right in order to be fully realized. null  Our hope is that the rules headed by pseudo nonterminals could make available to an SPMT system all the rules that are typically available to a phrase-based system; and that the sibling rules could provide a sufficiently robust generalization layer for integrating pseudo, partially realized constituents into the overall decoding process.</Paragraph>
    <Paragraph position="13">  The SPMT composed model 2 uses all rule types described in the previous models.</Paragraph>
    <Section position="1" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
2.3 Estimating rule probabilities
</SectionTitle>
      <Paragraph position="0"> For each model, we extract all rule instances that are licensed by a symmetrized Giza-aligned parallel corpus and the constraints we put on the model.</Paragraph>
      <Paragraph position="1"> We condition on the root node of each rule and use the rule counts f(r) and a basic maximum likelihood estimator to assign to each rule type a conditional probability (see equation 5).</Paragraph>
      <Paragraph position="3"> It is unlikely that this joint probability model can be discriminative enough to distinguish between good and bad translations. We are not too concerned though because, in practice, we decode using a larger set of submodels (feature functions).</Paragraph>
      <Paragraph position="4"> Given the way all our lexicalized xRS rules have been created, one can safely strip out the syntactic information and end up with phrase-to-phrase translation rules. For example, in string-to-string world, rule r5 in Figure 1 can be rewritten as &amp;quot;france -FRANCE&amp;quot;; and rule r6 can be rewritten as &amp;quot;france and - FRANCE AND&amp;quot;. When one analyzes the lexicalized xRS rules in this manner, it is easy to associate with them any of the submodel probability distributions that have been proven useful in statistical phrase-based MT. The non-lexicalized rules are assigned probability distributions under these submodels as well by simply assuming a NULL phrase for any missing lexicalized source or target phrase.</Paragraph>
      <Paragraph position="5"> In the experiments described in this paper, we use the following submodels (feature functions): Syntax-based-like submodels: proot(ri) is the root normalized conditional probability of all the rules in a model.</Paragraph>
      <Paragraph position="6"> pcfg(ri) is the CFG-like probability of the non-lexicalized rules in the model. The lexicalized rules have by definition pcfg = 1.</Paragraph>
      <Paragraph position="7"> is lexicalized(ri) is an indicator feature function that has value 1 for lexicalized rules, and value 0 otherwise.</Paragraph>
      <Paragraph position="8"> is composed(ri) is an indicator feature function that has value 1 for composed rules.</Paragraph>
      <Paragraph position="9"> is lowcount(ri) is an indicator feature function that has value 1 for the rules that occur less than 3 times in the training corpus.</Paragraph>
      <Paragraph position="10"> Phrase-based-like submodels: lex pef(ri) is the direct phrase-based conditional probability computed over the foreign/source and target phrases subsumed by a rule.</Paragraph>
      <Paragraph position="11"> lex pfe(ri) is the inverse phrase-based conditional probability computed over the source and target phrases subsumed by a rule.</Paragraph>
      <Paragraph position="12"> m1(ri) is the IBM model 1 probability computed over the bags of words that occur on the source and target sides of a rule.</Paragraph>
      <Paragraph position="13"> m1inv(ri) is the IBM model 1 inverse probability computed over the bags of words that occur on the source and target sides of a rule.</Paragraph>
      <Paragraph position="14"> lm(e) is the language model probability of the target translation under an ngram language model.</Paragraph>
      <Paragraph position="15"> wp(e) is a word penalty model designed to favor longer translations.</Paragraph>
      <Paragraph position="16"> All these models are combined log-linearly during decoding. The weights of the models are computed automatically using a variant of the Maximum Bleu training procedure proposed by Och (2003).</Paragraph>
      <Paragraph position="17"> The phrase-based-like submodels have been proved useful in phrase-based approaches to SMT (Och and Ney, 2004). The first two syntax-based submodels implement a &amp;quot;fused&amp;quot; translation and lexical grounded distortion model (proot) and a syntax-based distortion model (pcfg). The indicator submodels are used to determine the extent to which our system prefers lexicalized vs. non-lexicalized rules; simple vs. composed rules; and high vs. low count rules.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="48" end_page="48" type="metho">
    <SectionTitle>
3 Decoding
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
3.1 Decoding with one SPMT model
</SectionTitle>
      <Paragraph position="0"> We decode with each of our SPMT models using a straightforward, bottom-up, CKY-style decoder that builds English syntactic constituents on the top of Chinese sentences. The decoder uses a binarized representation of the rules, which is obtained via a syncronous binarization procedure (Zhang et al., 2006). The CKY-style decoder computes the probability of English syntactic constituents in a bottom up fashion, by log-linearly interpolating all the submodel scores described in Section 2.3.</Paragraph>
      <Paragraph position="1"> The decoder is capable of producing nbest derivations and nbest lists (Knight and Graehl, 2005), which are used for Maximum Bleu training (Och, 2003). When decoding the test corpus, the decoder returns the translation that has the most probable derivation; in other words, the sum operator in equation 4 is replaced with an argmax.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
3.2 Decoding with multiple SPMT models
</SectionTitle>
      <Paragraph position="0"> Combining multiple MT outputs to increase performance is, in general, a difficult task (Matusov et al., 2006) when significantly different engines compete for producing the best outputs. In our case, combining multiple MT outputs is much simpler because the submodel probabilities across the four models described here are mostly identifical, with the exception of the root normalized and CFG-like submodels which are scaled differently - since Model 2 composed has, for example, more rules than Model 1, the root normalized and CFG-like submodels have smaller probabilities for identical rules in Model 2 composed than in Model 1. We compare these two probabilities across the submodels and we scale all model probabilities to be compatible with those of Model 2 composed.</Paragraph>
      <Paragraph position="1"> With this scaling procedure into place, we produce 6,000 non-unique nbest lists for all sentences in our development corpus, using all SPMT submodels. We concatenate the lists and we learn a new combination of weights that maximizes the Bleu score of the combined nbest list using the same development corpus we used for tuning the individual systems (Och, 2003). We use the new weights in order to rerank the nbest outputs on the test corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML