File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1004_metho.xml
Size: 21,375 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1004"> <Title>Computational Complexity of Statistical Machine Translation</Title> <Section position="3" start_page="25" end_page="25" type="metho"> <SectionTitle> * Viterbi Alignment </SectionTitle> <Paragraph position="0"> Given the model parameters and a sentence pair (f,e), determine the most probable alignment between f and e.</Paragraph> <Paragraph position="2"> This forms the core of model training via the EM algorithm. Please see Section 2.3 for a description of the computational task involved in the EM iterations.</Paragraph> </Section> <Section position="4" start_page="25" end_page="26" type="metho"> <SectionTitle> * Conditional Probability </SectionTitle> <Paragraph position="0"> Given the model parameters and a sentence pair (f,e), compute P(f|e).</Paragraph> <Paragraph position="2"> Giventhe model parameters and asentencef, determine the most probable translation of f.</Paragraph> <Paragraph position="4"> Given the model parameters and asentencef, determine the most probable translation and alignment pair for f.</Paragraph> <Paragraph position="6"> Viterbi Alignment computation finds applications not only in SMT but also in other areas of Natural Language Processing (Wang, 1998), (Marcu, 2002). Expectation Evaluation is the soul of parameter estimation (Brown et al., 1993), (Al-Onaizan et al., 1999). Conditional Probability computation is important in experimentally studying the concentration of the probability mass around the Viterbi alignment, i.e. in determining the goodness of the Viterbi alignment in comparison to the rest of the alignments. Decoding is an integral component of all SMT systems (Wang, 1997), (Tillman, 2000), (Och et al., 2001), (Germann et al., 2003), (Udupa et al., 2004). Exact Decoding is the original decoding problem as defined in (Brown et al., 1993) and Relaxed Decoding is the relaxation of the decoding problem typically used in practice.</Paragraph> <Paragraph position="7"> While several heuristics have been developed by practitioners of SMT for the computational tasks involving IBM models, not much is known about the computational complexity ofthese tasks.</Paragraph> <Paragraph position="8"> In their seminal paper on SMT, Brownand his colleagues highlighted the problems weface aswe go from IBM Models 1-2 to 3-5(Brown et al., 1993) &quot;Asweprogress from Model1toModel5, evaluating the expectations that gives us counts becomes increasingly difficult. In Models 3 and 4, we must be content with approximate EM iterations because it is not feasible to carry out sums over all possible alignments for these models. In practice, we are never sure that we have found the Viterbi alignment&quot;.</Paragraph> <Paragraph position="9"> However, neither their work nor the subsequent research in SMT studied the computational complexity of these fundamental problems with the exception of the Decoding problem. In (Knight, 1999) it was proved that the Exact Decoding problem isNP-Hard when the language model is a bi-gram model.</Paragraph> <Paragraph position="10"> Our results may be summarized as follows: 1. ViterbiAlignmentcomputation isNP-Hard for IBM Models 3, 4, and 5.</Paragraph> <Paragraph position="11"> 2. Expectation Evaluation in EM Iterations is #P-Complete for IBM Models 3, 4, and 5.</Paragraph> <Paragraph position="12"> 3. Conditional Probability computation is #P-Complete for IBM Models 3, 4, and 5.</Paragraph> <Paragraph position="13"> 4. Exact Decoding is #P-Hard for IBM Models 3, 4, and 5.</Paragraph> <Paragraph position="14"> 5. Relaxed Decoding is NP-Hard for IBM Models 3, 4, and 5.</Paragraph> <Paragraph position="15"> Note that our results for decoding are sharper than that of (Knight, 1999). Firstly, we show that Exact Decoding is #P-Hard for IBM Models 3-5 and not just NP-Hard. Secondly, we show that Relaxed Decoding is NP-Hard for Models 3-5 even when the language model is a uniform distribution. null The rest of the paper is organized as follows. We formally define all the problems discussed in the paper (Section 2). Next, we take up each of the problems discussed in this section and derive the stated result for them (Section 3). After this, we discuss the implications of our results (Section 4) and suggest future directions (Section 5).</Paragraph> </Section> <Section position="5" start_page="26" end_page="27" type="metho"> <SectionTitle> 2 Problem Definition </SectionTitle> <Paragraph position="0"> Consider the functions f,g : S[?] - {0,1}. We say that g [?]mp f (g is polynomial-time many-one reducible to f), if there exists a polynomial time reduction r(.) such that g(x) = f(r(x)) for all input instances x [?] S[?]. This means that given a machine to evaluate f(.) in polynomial time, there exists a machine that can evaluate g(.) in polynomial time. We say a function f is NP-Hard, if all functions in NP are polynomial-time many-one reducible to f. In addition, if f [?] NP, then we say that f is NP-Complete.</Paragraph> <Paragraph position="1"> Also relevant to our work are counting functions that answer queries such as &quot;how many computation paths exist for accepting a particular instance of input?&quot; Let w be a witness for the acceptance of an input instance x and kh(x,w) be a polynomial time witness checking function (i.e.</Paragraph> <Paragraph position="2"> kh(x,w) [?] P). The function f : S[?] -Nsuch that</Paragraph> <Paragraph position="4"> lies in the class #P, where p(.) is a polynomial.</Paragraph> <Paragraph position="5"> Given functions f,g : S[?] -N, we say that g is polynomial-time Turing reducible to f (i.e. g [?]T f) if there is a Turing machine with an oracle for f that computes g in time polynomial in the size of the input. Similarly, we say that f is #P-Hard, if every function in #P can be polynomial time Turing reduced to f. If f is #P-Hard and is in #P, then we say that f is #P-Complete.</Paragraph> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 2.1 Viterbi Alignment Computation </SectionTitle> <Paragraph position="0"> VITERBI-3 is defined as follows. Given the parameters of IBM Model 3 and a sentence pair (f,e), compute the most probable alignmenta[?] betwenf</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 2.2 Conditional Probability Computation </SectionTitle> <Paragraph position="0"> PROBABILITY-3 is defined as follows. Given the parameters of IBM Model 3, and a sentence pair (f,e), compute the probability</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 2.3 Expectation Evaluation in EM Iterations </SectionTitle> <Paragraph position="0"> (f,e)-COUNT-3, (ph,e)-COUNT-3, (j,i,m,l)-COUNT-3, 0-COUNT-3, and 1-COUNT-3 are defined respectively as follows. Given the parameters of IBM Model 3, and a sentence pair (f,e), compute the following 4:</Paragraph> <Paragraph position="2"/> </Section> <Section position="4" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 2.4 Decoding </SectionTitle> <Paragraph position="0"> E-DECODING-3 and R-DECODING-3 are defined as follows. Given the parameters of IBM Model 3, and a sentence f, compute its most probable translation according tothefollowing equations respectively. null</Paragraph> <Paragraph position="2"/> </Section> <Section position="5" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 2.5 SETCOVER </SectionTitle> <Paragraph position="0"> Given a collection of sets C = {S1,...,Sl} and a set X [?] [?]li=1Si, find the minimum cardinality subset Cprime of C such that every element in X belongs to at least one member of Cprime.</Paragraph> <Paragraph position="1"> SETCOVER is a well-known NP-Complete problem. If SETCOVER [?]mp f, then f is NPHard. null</Paragraph> </Section> <Section position="6" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 2.6 PERMANENT </SectionTitle> <Paragraph position="0"> Given a matrixM = [Mj,i]nxn whose entries are either 0 or 1, compute the following: perm(M) = summationtextpiproducttextnj=1Mj,pij where pi is a permutation of 1,...,n.</Paragraph> <Paragraph position="1"> This problem is the same as that of counting the number of perfect matchings in a bipartite graph and is known to be #P-Complete (?). If PERMANENT [?]T f, then f is #P-Hard.</Paragraph> </Section> <Section position="7" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 2.7 COMPAREPERMANENTS </SectionTitle> <Paragraph position="0"> Given two matrices A = [Aj,i]nxn and B = [Bj,i]nxn whoseentries areeither 0or1,determine which of them has a larger permanent. PERMANENT is known to be Turing reducible to COMPAREPERMANENTS (Jerrum, 2005) and therefore, if COMPAREPERMANENTS [?]T f, then f is #PHard. null</Paragraph> </Section> </Section> <Section position="6" start_page="27" end_page="30" type="metho"> <SectionTitle> 3 Main Results </SectionTitle> <Paragraph position="0"> In this section, we present the main reductions for the problems with Model 3 as the translation model. Our reductions can be easily carried over toModels 4[?]5 withminor modifications. Inorder to keep the presentation of the main ideas simple, we let the lexicon, distortion, and fertility models tobe anynon-negative functions and not just probability distributions in our reductions.</Paragraph> <Section position="1" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 3.1 VITERBI-3 </SectionTitle> <Paragraph position="0"> We show that VITERBI-3 is NP-Hard.</Paragraph> <Paragraph position="1"> Lemma 1 SETCOVER [?]mp VITERBI-3.</Paragraph> <Paragraph position="2"> Proof: We give a polynomial time many-one reduction from SETCOVER to VITERBI-3. Given a collection of sets C = {S1,...,Sl} and a set X[?] [?]li=1Si, we create an instance of VITERBI-3 as follows: For each set Si [?] C, we create a word ei (1 [?] i [?] l). Similarly, for each element vj [?] X we create a word fj (1 [?] j [?] |X |= m). We set the model parameters as follows:</Paragraph> <Paragraph position="4"> We can construct a cover for X from the output of VITERBI-3 by defining Cprime = {Si|phi > 0}. We</Paragraph> <Paragraph position="6"> Therefore, Viterbi alignment results in the minimum cover for X.</Paragraph> </Section> <Section position="2" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 3.2 PROBABILITY-3 </SectionTitle> <Paragraph position="0"> We show that PROBABILITY-3 is #P-Complete.</Paragraph> <Paragraph position="1"> We begin by proving the following: Lemma 2 PERMANENT [?]T PROBABILITY-3.</Paragraph> <Paragraph position="2"> Proof: Given a 0,1-matrix M = [Mj, i]nxn, we define f = f1 ...fn and e = e1 ...en where each ei and fj is distinct and set the Model 3 parameters as follows:</Paragraph> <Paragraph position="4"> Clearly, with the above parameter setting,</Paragraph> <Paragraph position="6"> Thus, by construction, PROBABILITY-3 computes perm(M). Besides, the construction conserves the number of witnesses. Hence, PERMANENT [?]T PROBABILITY-3.</Paragraph> <Paragraph position="7"> We now prove that Lemma 3 PROBABILITY-3 is in #P.</Paragraph> <Paragraph position="8"> Proof: Let (f,e) be the input to PROBABILITY-3. Let m and l be the lengths of f and e respectively. With each alignment a = (a1,a2,...,am) we associate a unique number na = a1a2 ...am in base l + 1. Clearly, 0 [?] na [?] (l + 1)m [?] 1. Let w be the binary encoding of na. Conversely, with every binary string w we can associate an alignment a if the value of w is in the range 0,...,(l + 1)m [?] 1. It requires O(mlog(l + 1)) bits to encode an alignment. Thus, given an alignment we can compute its encoding and given the encoding we can compute the corresponding alignment in time polynomial in l and m. Similarly, given an encoding we cancomputeP (f,a|e)intimepolynomial inl and m. Now, if p(.) is a polynomial, then function</Paragraph> <Paragraph position="10"> is in #P. Choose p(x) = ceilingleftxlog2 (x + 1)ceilingright.</Paragraph> <Paragraph position="11"> Clearly, all alignments can be encoded using at most p(|(f, e)|) bits. Therefore, if (f,e) computes P (f|e) and hence, PROBABILITY-3 is in #P.</Paragraph> <Paragraph position="12"> It follows immediately from Lemma 2 and Lemma 3 that Theorem 1 PROBABILITY-3 is #P-Complete.</Paragraph> </Section> <Section position="3" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 3.3 (f,e)-COUNT-3 </SectionTitle> <Paragraph position="0"> Lemma 4 PERMANENT [?]T (f,e)-COUNT-3.</Paragraph> <Paragraph position="1"> Proof: The proof is similar to that of Lemma 2. Let f = f1 f2 ... fn ^f and e =</Paragraph> <Paragraph position="3"> The rest of the parameters are set as in Lemma 2.</Paragraph> <Paragraph position="4"> LetAbe the set ofalignmentsa, such that an+1 = n+1 and an1 is a permutation of 1,2,...,n. Now,</Paragraph> <Paragraph position="6"> Therefore, PERMANENT [?]T COUNT-3.</Paragraph> <Paragraph position="7"> Lemma 5 (f,e)-COUNT-3 is in #P.</Paragraph> <Paragraph position="8"> Proof: The proof is essentially the same as that of Lemma 3. Note that given an encoding w, P (f,a|e)summationtextmj=1 d (fj,f)dparenleftbigeaj,eparenrightbig can be evaluated in time polynomial in |(f,e)|.</Paragraph> <Paragraph position="9"> Hence, from Lemma 4 and Lemma 5, it follows Proof: We proceed as in the proof of Lemma 4 with some modifications. Let e = e1 ...ei[?]1^eei ...en and</Paragraph> <Paragraph position="11"> are set as in Lemma 4. Let A be the set of alignments, a, such that a is a permutation of 1,2,...,(n + 1) and aj = i. Observe that P (f,a|e) is non-zero only for the alignments in A. It follows immediately that with these parameter settings, c(j|i,n,n;f,e) = perm(M).</Paragraph> <Paragraph position="12"> Lemma 7 (j,i,m,l)-COUNT-3 is in #P.</Paragraph> <Paragraph position="13"> Proof: Similar to the proof of Lemma 5.</Paragraph> <Paragraph position="14"> Theorem 3 (j,i,m,l)-COUNT-3 is #PComplete. null ^f ... ^f. Let A be the set of alignments for which an1 is a permutation of 1,2,...,n and</Paragraph> <Paragraph position="16"> The rest of the parameters are set as in Lemma 4.</Paragraph> <Paragraph position="17"> Note that P (f,a|e) is non-zero only for the alignments inA. It follows immediately that with these parameter settings, c(k|^e;f,e) = perm(M).</Paragraph> <Paragraph position="18"> Lemma 9 (ph,e)-COUNT-3 is in #P.</Paragraph> <Paragraph position="19"> Proof: Similar to the proof of Lemma 5.</Paragraph> <Paragraph position="20"> Theorem 4 (ph,e)-COUNT-3 is #P-Complete.</Paragraph> </Section> <Section position="4" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 3.6 0-COUNT-3 </SectionTitle> <Paragraph position="0"> Lemma 10 PERMANENT [?]T 0-COUNT-3.</Paragraph> <Paragraph position="1"> Proof: Let e = e1 ...en and f = f1 ...fn ^f. Let A be the set of alignments, a, such that an1 is a permutation of 1,...,n and an+1 = 0. We set</Paragraph> <Paragraph position="3"> The rest of the parameters are set as in Lemma 4.</Paragraph> <Paragraph position="4"> It is easy to see that with these settings, c(0;f,e)(n[?]2) = perm(M).</Paragraph> <Paragraph position="5"> Lemma 11 0-COUNT-3 is in #P.</Paragraph> <Paragraph position="6"> Proof: Similar to the proof of Lemma 5. Theorem 5 0-COUNT-3 is #P-Complete. 3.7 1-COUNT-3 Lemma 12 PERMANENT [?]T 1-COUNT-3. Proof: We set the parameters as in Lemma 10. It follows immediately that c(1;f,e) = perm(M).</Paragraph> <Paragraph position="7"> Lemma 13 1-COUNT-3 is in #P.</Paragraph> <Paragraph position="8"> Proof: Similar to the proof of Lemma 5. Theorem 6 1-COUNT-3 is #P-Complete. 3.8 E-DECODING-3 Lemma 14 COMPAREPERMANENTS [?]T E- null Proof: Let M and N be the two 0-1 matrices. Let f = f1f2 ...fn, e(1) = e(1)1 e(1)2 ...e(1)n and e(2) = e(2)1 e(2)2 ...e(2)n . Further, let e(1) and e(2) have no words in common and each word appears exactly once. By setting the bigram language model probabilities of the bigrams that occur ine(1) and e(2) to 1 and all other bigram probabilities to 0, we can ensure that the only translations considered by E-DECODING-3 are indeed e(1) and e(2) and P parenleftbige(1)parenrightbig = P parenleftbige(2)parenrightbig = 1. We then set</Paragraph> <Paragraph position="10"> perm(N). Therefore, given the output of E-DECODING-3 we can find out which of M and N has a larger permanent.</Paragraph> <Paragraph position="11"> Hence E-DECODING-3 is #P[?]Hard.</Paragraph> </Section> <Section position="5" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.9 R-DECODING-3 </SectionTitle> <Paragraph position="0"> Lemma 15 SETCOVER [?]mp R-DECODING-3 Proof: Given an instance of SETCOVER, we set the parameters as in the proof of Lemma1 with the following modification:</Paragraph> <Paragraph position="2"> Let e be the optimal translation obtained by solving R-DECODING-3. As the language model is uniform, the exact order of the words in e is not important. Now, we observe that: * e contains words only from the set {e1,e2,...,el}. This is because, there cannot be any zero fertility word as n(0|e) = 0 and the only words that can have a non-zero fertility are from {e1,e2,...,el} due to the way we have set the lexicon parameters.</Paragraph> <Paragraph position="3"> * Nowordoccurs morethan onceine. Assume on the contrary that the word ei occurs k > 1 times in e. Replace these k occurrences by only one occurrence of ei and connect all the words connected to them to this word. This would increase the score of e by a factor of 2k[?]1 > 1 contradicting the assumption on the optimality of e.</Paragraph> <Paragraph position="4"> Asa result, the only candidates foreare subsets of {e1,e2,...,el}in any order. It is now straight forward to verify that a minimum set cover can be recovered fromeas shown in the proof of Lemma 1. 3.10 IBM Models 4 and 5 The reductions are for Model 3 can be easily extended to Models 4 and 5. Thus, we have the following: null</Paragraph> </Section> </Section> <Section position="7" start_page="30" end_page="30" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> Our results answer several open questions on the computation of Viterbi Alignment and Expectation Evaluation. Unless P = NP and P#P = P, there can be no polynomial time algorithms for either of these problems. The evaluation of expectations becomes increasingly difficult as we go from IBM Models 1-2 to Models 3-5 exactly because the problem is #P-Complete for the latter models. There cannot be any trick for IBM Models 3-5 that would help us carry out the sums over all possible alignments exactly. There cannot exist a closed form expression (whose representation is polynomial inthe size of theinput) for P (f|e) and the counts in the EM iterations for Models 3-5.</Paragraph> <Paragraph position="1"> It should be noted that the computation of Viterbi Alignment and Expectation Evaluation is easy for Models 1-2. What makes these computations hard for Models 3-5? To answer this question, we observe that Models 1-2 lack explicit fertility modelunlike Models3-5. Intheformermodels, fertility probabilities are determined by the lexicon and alignment models. Whereas, in Models 3-5, the fertility model is independent of the lexicon and alignment models. It is precisely this freedom that makes computations on Models 3-5 harder than the computations on Models 1-2.</Paragraph> <Paragraph position="2"> There are three different ways of dealing with the computational barrier posed by our problems.</Paragraph> <Paragraph position="3"> The first of these is to develop a restricted fertility model that permits polynomial time computations. It remains to be found what kind of parameterized distributions are suitable for this purpose. The second approach is to develop provably good approximation algorithms for these problems as is done with many NP-Hard and #P-Hard problems. Provably good approximation algorithms exist for several covering problems including Set CoverandVertex Cover. Viterbi Alignment isitself a special type of covering problem and it remains to be seen whether some of the techniques developed for covering algorithms are useful for finding good approximations to Viterbi Alignment. Similarly, there exist several techniques for approximating the permanent of a matrix. It needs to be explored if some of these ideas can be adapted for Expectation Evaluation.</Paragraph> <Paragraph position="4"> As the third approach to deal with complexity, we can approximate the space of all possible (l + 1)m alignments by an exponentially large subspace. To be useful such large subspaces should also admit optimal polynomial time algorithms for the problems we have discussed in this paper. This is exactly the approach taken by (Udupa, 2005) for solving the decoding and Viterbi alignment problems. They show that very efficient polynomial time algorithms can be developed for both Decoding and Viterbi Alignment problems. Not only the algorithms are provablysuperior inacomputational complexity sense, (Udupa, 2005) are also able to get substantial improvements in BLEU and NIST scores over the Greedy decoder.</Paragraph> </Section> class="xml-element"></Paper>