Squibs and Discussions 
Decoding Complexity in Word-Replacement 
Translation Models 
Kevin Knight* 
University of Southern California 
Statistical machine translation is a relatively new approach to the long-standing problem of trans- 
lating human languages by computer. Current statistical techniques uncover translation rules 
from bilingual training texts and use those rules to translate new texts. The general architecture 
is the source-channel model: an English string is statistically generated (source), then statistically 
transformed into French (channel). In order to translate (or "decode") a French string, we look 
for the most likely English source. We show that for the simplest form of statistical models, this 
problem is NP-complete, i.e., probably exponential in the length of the observed sentence. We 
trace this complexity to factors not present in other decoding problems. 
1. Introduction 
Statistical models are widely used in attacking natural language problems. The source- 
channel framework is especially popular, finding applications in part-of-speech tag- 
ging, accent restoration, transliteration, speech recognition, and many other areas. In 
this framework, we build an underspecified model of how certain structures (such as 
strings) are generated and transformed. We then instantiate the model through training 
on a database of sample structures and transformations. 
Recently, Brown et al. (1993) built a source-channel model of translation between 
English and French. They assumed that English strings are produced according to some 
stochastic process (source model) and transformed stochastically into French strings 
(channel model). To translate French to English, it is necessary to find an English 
source string that is likely according to the models. With a nod to its cryptographic 
antecedents, this kind of translation is called decoding. This paper looks at decoding 
complexity. 
2. Part-of-Speech Tagging 
The prototype source-channel application in natural language is part-of-speech tagging 
(Church 1988). We review it here for purposes of comparison with machine translation. 
Source strings comprise sequences of part-of-speech tags like noun, verb, etc. A 
simple source model assigns a probability to a tag sequence tl .. •tm based on the prob- 
abilities of the tag pairs inside it. Target strings are English sentences, e.g., wl ... win. 
The channel model assumes each tag is probabilistically replaced by a word (e.g., noun 
by dog) without considering context. More concretely, we have: 
• v total tags 
• A bigram source model with v 2 parameters of the form b(t\]t), where 
P(tl... tin) "" b(tllboundary) • b(t2\]tl) ..... b(tn\]tm-1) " b(boundary\]tm) 
• Information Sciences Institute, Marina del Rey, CA 90292 
@ 1999 Association for Computational Linguistics 
Computational Linguistics Volume 25, Number 4 
• A substitution channel model with parameters of the form s(w\]t), where 
P(wl ... Wmlh... tm) ~ S(Wllh)" S(W21t2)" ..." S(Wraltm) 
• an m-word text annotated with correct tags 
• an m-word unannotated text 
We can assign parts-of-speech to a previously unseen word sequence wl... Wm 
by finding the sequence tl... tm that maximizes P(h... tmlWl... Wm). By Bayes' rule, 
we can equivalently maximize P(h ... tm)'P(wl.., wmlh.., tin), which we can calculate 
directly from the b and s tables above. 
Three interesting complexity problems in the source-channel framework are: 
• Can parameter values be induced from annotated text efficiently? 
• Can optimal decodings be produced efficiently? 
• Can parameter values be induced from unannotated text efficiently? 
The first problem is solved in O(m) time for part-of-speech tagging--we simply 
count tag pairs and word/tag pairs, then normalize. The second problem seems to 
require enumerating all O(v m) potential source sequences to find the best, but can 
actually be solved in O(mv 2) time with dynamic programming. We turn to the third 
problem in the context of another application: cryptanalysis. 
3. Substitution Ciphers 
In a substitution cipher, a plaintext message like HELLO WORLD is transformed into 
a ciphertext message like EOPPX YXAPF via a fixed letter-substitution table. As with 
tagging, we can assume an alphabet of v source tokens, a bigram source model, a 
substitution channel model, and an m-token coded text. 
If the coded text is annotated with corresponding English, then building source 
and channel models is trivially O(m). Comparing the situation to part-of-speech tag- 
ging: 
• (Bad news.) Cryptanalysts rarely get such coded/decoded text pairs and 
must employ "ciphertext-only" attacks using unannotated training data. 
• (Good news.) It is easy to train a source model separately, on raw 
unannotated English text that is unconnected to the ciphertext. 
Then the problem becomes one of acquiring a channel model, i.e., a table s(fle ) with 
an entry for each code-letter/plaintext-letter pair. Starting with an initially uniform 
table, we can use the estimation-maximization (EM) algorithm to iteratively revise 
s(fle ) so as to increase the probability of the observed corpus P(f). Figure 1 shows a 
naive EM implementation that runs in O(mv m) time. There is an efficient O(mv 2) EM 
implementation based on dynamic programming that accomplishes the same thing. 
Once the s(fle ) table has been learned, there is a similar O(mv 2) algorithm for optimal 
decoding. Such methods can break English letter-substitution ciphers of moderate 
size. 
608 
Knight Decoding Complexity 
Given coded text f of length m, a plaintext vocabulary of v tokens, and a source model b: 
1. set the s0Cle) table initially to be uniform 
2. for several iterations do: 
a, 
b. 
C. 
d. 
set up a count table c0CI e) with zero entries 
P(f) = 0 
for all possible source texts el... em (el drawn from vocabulary) 
compute P(e) = b(ell boundary), b(boundary lem). \[Ii~=2 b(eilei_l) 
m compute P(fle) = I~j=l s(fjleJ) 
P(f) += P(e). P(fle) 
for all source texts e of length m 
compute P(elf ) = P(e)'P(fle) P(f) 
for j = 1 to m 
c0~lej) += P(e~) 
normalize c0Ci e) table to create a revised s0CI e) 
Figure 1 
A naive application of the EM algorithm to break a substitution cipher. It runs in O(mv m) time. 
4. Machine Translation 
In our discussion of substitution ciphers, we were on relatively sure ground the 
channel model we assumed in decoding is actually the same one used by the cipher 
writer for encoding. That is, we know that plaintext is converted to ciphertext, letter by 
letter, according to some table. We have no such clear conception about how English 
gets converted to French, although many theories exist. Brown et al. (1993) recently cast 
some simple theories into a source-channel framework, using the bilingual Canadian 
parliament proceedings as training data. We may assume: 
• v total English words. 
• A bigram source model with V 2 parameters. 
• Various substitution/permutation channel models. 
• A collection of bilingual sentence pairs (sentence lengths < m). 
• A collection of monolingual French sentences (sentence lengths < m). 
Bilingual texts seem to exhibit English words getting substituted with French ones, 
though not one-for-one and not without changing their order. These are important 
departures from the two applications discussed earlier. 
In the main channel model of Brown et al. (1993), each English word token ei 
in a source sentence is assigned a "fertility" @, which dictates how many French 
words it will produce. These assignments are made stochastically according to a table 
n(~le ). Then actual French words are produced according to s(fie ) and permuted into 
new positions according to a distortion table d(jli, m, 1). Here, j and i are absolute tar- 
get/source word positions within a sentence, and m and I are target/source sentence 
lengths. 
Inducing n, s, and d parameter estimates is easy if we are given annotations in the 
form of word alignments. An alignment is a set of connections between English and 
French words in a sentence pair. In Brown et al. (1993), aligrtrnents are asymmetric-- 
each French word is connected to exactly one English word. 
609 
Computational Linguistics Volume 25, Number 4 
Given a collection of sentence pairs: 
1. collect estimates for the ~(m\]l) table directly from the data 
2. set the s0e\]e) table initially to be uniform 
3. for several iterations do: 
a. 
b. 
C. 
set up a count table c(f\]e) with zero entries 
for each given sentence pair e, f with respective lengths I, m: 
foral=ltol 
for a2 = 1 to 1 /* select connections for a word alignment */ 
for am = 1 to l 
compute P(al ...... \]e, f) - p(f' al ...... \]e) P(f\]e) 
for j = 1 to m 
c0~l%) += P(al... amle, f) 
normalize c0~\]ei ) table to create new s(fi\]ei) 
m 1-Ij=, s~l%) 
G'o; =, ' m " • ~,,-=, I-\[j=, s~le,;) 
Figure 2 
Naive EM training for the Model 1 channel model. 
Word-aligned data is usually not available, but large sets of unaligned bilin- 
gual sentence pairs do sometimes exist. A single sentence pair will have \[m possible 
alignments--for each French word position 1... m, there is a choice of I English po- 
sitions to connect to. A naive EM implementation will collect n, s, and d counts by 
considering each alignment, but this is expensive. (By contrast, part-of-speech tagging 
involves a single alignment, leading to O(m) training). Lacking a polynomial refor- 
mulation, Brown et al. (1993) decided to collect counts only over a subset of likely 
alignments. To bootstrap, they required some initial idea of what alignments are rea- 
sonable, so they began with several iterations of a simpler channel model (called 
Model 1) that has nicer computational properties. 
In the following description of Model 1, we represent an aligmnent formally as a 
vector al ..... am, with values aj ranging over English word positions 1... I. 
Model 1 Channel 
Parameters: c(mll ) and s(f\[e). 
Given a source sentence e of length I: 
1. choose a target sentence length m according to ¢(mll ) 
2. for j = 1 to m, choose an English word position aj according to the 
uniform distribution over 1... l 
3. for j = 1 to m, choose a French word j~ according to s~\]%) 
4. read off fl ...fro as the target sentence 
Because the same e may produce the same f by means of many different align- 
ments, we must sum over all of them to obtain P(fle): 
1 l 1 l m P(fl e) = c(mll) T~ Y~al=l ~a2=l """ Y~am=l I\]j=l s(fjleai) 
Figure 2 illustrates naive EM training for Model 1. If we compute P(fle) once per 
iteration, outside the "for a" loops, then the complexity is O(ml m) per sentence pair, 
per iteration. 
610 
Knight Decoding Complexity 
More efficient O(lm) training was devised by Brown et al. (1993). Instead of pro- 
cessing each alignment separately, they modified the algorithm in Figure 2 as follows: 
b. for each given sentence pair e, f of respective lengths l, m: 
for j = 1 to m 
sum = 0 
for i = 1 to I 
sum += s(fjlei) 
for i = 1 to I 
c(fjlei ) += s(fjlei ) / sum 
This works because of the algebraic trick that the portion of P(fle) we originally wrote 
1 1 m e m as ~al=," "" Y~am=l 1-Ij=l S(J~\[ aj) can be rewritten as YIj=I Y~I=I s(fjlei)" 
We next consider decoding. We seek a string e that maximizes P(elf), or equiva- 
lently maximizes P(e) • P(fle). A naive algorithm would evaluate all possible source 
strings, whose lengths are potentially unbounded. If we limit our search to strings 
at most twice the length m of our observed French, then we have a naive O(m2v 2m) 
method: 
Given a string f of length m 
1. for all source strings e of length I _ 2m: 
a. compute P(e) = b(el I boundary) - b(boundary Iet) " I-lli=2 b(eilei-1) 
m b. compute P(fle) = c(mll ) ~ l-\[j=1 ~1i=1 s(fjlei) 
c. compute P(elf) ,-~ P(e) • P(fle) 
d. if P(elf ) is the best so far, remember it 
2. print best e 
We may now hope to find a way of reorganizing this computation, using tricks like 
the ones above. Unfortunately, we are unlikely to succeed, as we now show. For 
proof purposes, we define our optimization problem with an associated yes-no decision 
problem: 
Definition: M1-OPTIMIZE 
Given a string f of length m and a set of parameter tables (b, e, s), return a string e of 
length I < 2m that maximizes P(elf), or equivalently maximizes 
1 P(e) - P(fle) = b(el I boundary) -b(boundary I el ) • 1-\[i=2 b(eilei-1) 
• c(mll ) ± vi m x-,! l m llj=l /'i=1 s(fjlei) 
Definition: M1-DECIDE 
Given a string f of length m, a set of parameter tables (b, e, s), and a real number k, 
does there exist a string e of length l < 2m such that P(e) • P(fle) > k? 
We will leave the relationship between these two problems somewhat open and 
intuitive, noting only that M1-DECIDE's intractability does not bode well for M1- 
OPTIMIZE. 
611 
Computational Linguistics Volume 25, Number 4 
Theorem 
M1-DECIDE is NP-complete. 
To show inclusion in NP, we need only nondeterministically choose e for any 
problem instance and verify that it has the requisite P(e) • P(fle) in O(m 2) time. Next 
we give separate polynomial-time reductions from two NP-complete problems. Each 
reduction highlights a different source of complexity. 
4.1 Reduction 1 (from Hamilton Circuit Problem) 
The Hamilton Circuit Problem asks: given a directed graph G with vertices labeled 
0,...,n, does G have a path that visits each vertex exactly once and returns to its 
starting point? We transform any Hamilton Circuit instance into an M1-DECIDE in- 
stance as follows. First, we create a French vocabulary fl ..... fn, associating word fi 
with vertex i in the graph. We create a slightly larger English vocabulary e0 ..... en, 
with e0 serving as the "boundary" word for source model scoring. Ultimately, we will 
ask M1-DECIDE to decode the string fl...fn. 
We create channel model tables as follows: 
s~.lei) = {10 ifi=j otherwise 
¢(mll) = {10 ifl=m 
otherwise 
These tables ensure that any decoding e off1 ...fn will contain the n words el .... , en 
(in some order). We now create a source model. For every pair (i,j) such that 0 G i,j G n: 
= ~l/n if graph G contains an edge from vertex i to vertex j 
b(ej\[ei) to 
otherwise 
Finally, we set k to zero. To solve a Hamilton Circuit Problem, we transform it as 
above (in quadratic time), then invoke M1-DECIDE with inputs b, c, s, k, and fl...fm. 
If M1-DECIDE returns yes, then there must be some string e with both P(e) and 
P(fle) nonzero. The channel model lets us conclude that if P(f\[e) is nonzero, then e 
contains the n words el,..., en in some order. If P(e) is nonzero, then every bigram in 
e (including the two boundary bigrams involving e0) has nonzero probability. Because 
each English word in e corresponds to a unique vertex, we can use the order of words 
in e to produce an ordering of vertices in G. We append vertex 0 to the beginning 
and end of this list to produce a Hamilton Circuit. The source model construction 
guarantees an edge between each vertex and the next. 
If M1-DECIDE returns no, then we know that every string e includes at least one 
zero value in the computation of either P(e) or P(fle). From any proposed Hamilton 
Circuit--i.e., some ordering of vertices in G--we can construct a string e using the 
same ordering. This e will have P(f\]e) = 1 according to the channel model. Therefore, 
P(e) = 0. By the source model, this can only happen if the proposed "circuit" is actually 
broken somewhere. So no Hamilton Circuit exists. 
Figure 3 illustrates the intuitive correspondence between selecting a good word 
order and finding a Hamilton Circuit. We note that Brew (1992) discusses the NP- 
completeness of a related problem, that of finding some permutation of a string that 
is acceptable to a given context-free grammar. Both of these results deal with decision 
problems. Returning to optimization, we recall another circuit task called the Traveling 
612 
Knight Decoding Complexity 
my 
b°uid~N ~ r ~'/~~ 
falls Thursday 
Figure 3 
Selecting a good source word order is like solving the Hamilton Circuit Problem. If we assume 
that the channel model offers deterministic, word-for-word translations, then the bigram 
source model takes responsibility for ordering them. Some word pairs in the source language 
may be illegal. In that case, finding a legal word ordering is like finding a complete circuit in a 
graph. (In the graph shown above, a sample circuit is boundary --, this ---* year ~ comma ~ my 
--* birthday --~ falls --~ on ---* a --+ Thursday ~ boundary). If word pairs have probabilities attached 
to them, then word ordering resembles the finding the least-cost circuit, also known as the 
Traveling Salesman Problem. 
Salesman Problem. It introduces edge costs dq and seeks a minimum-cost circuit. By 
viewing edge costs as log probabilities, we can cast the Traveling Salesman Problem 
as one of optimizing P(e), that is, of finding the best source word order in Model 1 
decoding. 
4.2 Reduction 2 (from Minimum Set Cover Problem) 
The Minimum Set Cover Problem asks: given a collection C of subsets of finite set S, 
and integer n, does C contain a cover for S of size ~ n, i.e., a subcollection whose 
union is S? We now transform any instance of Minimum Set Cover into an instance 
of M1-DECIDE, using polynomial time. This time, we assume a rather neutral source 
model in which all strings of a given length are equally likely, but we construct a more 
complex channel. 
We first create a source word ei for each subset in C, and let gi be the size of 
that subset. We create a table b(eilej) with values set uniformly to the reciprocal of the 
source vocabulary size (i.e., the number of subsets in C). 
Assuming S has m elements, we next create target words fl ..... fm corresponding 
to each of those elements, and set up channel model tables as follows: 
if the element in S corresponding to j~ is also in the subset 
corresponding to ei 
otherwise 
¢(mll) = {10 ifl~n 
otherwise 
fl if l>n ~(m 
otherwise 
Finally, we set k to zero. This completes the reduction. To solve an instance of 
Minimum Set Cover in polynomial time, we transform it as above, then call M1- 
DECIDE with inputs b, c, s, k, and the words fl ..... fm in any order. 
613 
Computational Linguistics Volume 25, Number 4 
obtained 
m~ 
however 
) 
ted 
J ..... ~d left left the meal 
Figure 4 
Selecting a concise set of source words is like solving the Minimum Set Cover Problem. A 
channel model with overlapping, one-to-many dictionary entries will typically license many 
decodings. The source model may prefer short decodings over long ones. Searching for a 
decoding of length _< n is difficult, resembling the problem of covering a finite set with a small 
collection of subsets. In the example shown above, the smallest acceptable set of source words 
is {and, cooked, however, left, comma, period}. 
If M1-DECIDE returns yes, then some decoding e with P(e) • P(f\]e) > 0 must exist. 
We know that e must contain n or fewer words--otherwise P(f\[e) = 0 by the c table. 
Furthermore, the s table tells us that every word fj is covered by at least one English 
word in e. Through the one-to-one correspondence between elements of e and C, we 
produce a set cover of size G n for S. 
Likewise, if M1-DECIDE returns no, then all decodings have P(e) • P(f\[e) = 0. 
Because there are no zeroes in the source table b, every e has P(f\[e) = 0. Therefore 
either (1) the length of e exceeds n, or (2) somefj is left tmcovered by the words in e. 
Because source words cover target words in exactly the same fashion as elements of C 
cover S, we conclude that there is no set cover of size < n for S. Figure 4 illustrates the 
intuitive correspondence between source word selection and minimum set covering. 
5. Discussion 
The two proofs point up separate factors in MT decoding complexity. One is word- 
order selection. But even if any word order will do, there is still the problem of picking 
a concise decoding in the face of overlapping bilingual dictionary entries. The former 
is more closely tied to the source model, and the latter to the channel model, though 
the complexity arises from the interaction of the two. 
We should note that Model 1 is an intentionally simple translation model, one 
whose primary purpose in machine translation has been to allow bootstrapping into 
more complex translation models (e.g., IBM Models 2-5). It is easy to show that the 
intractability results also apply to stronger "fertility/distortion" models; we assign 
zero probability to fertilities other than 1, and we set up uniform distortion tables. 
Simple translation models like Model 1 find more direct use in other applications 
(e.g., lexicon construction, idiom detection, psychological norms, and cross-language 
information retrieval), so their computational properties are of wider interest. 
614 
Knight Decoding Complexity 
The proofs we presented are based on a worst-case analysis. Real s, e, and b ta- 
bles may have properties that permit faster optimal decoding than the artificial tables 
constructed above. It is also possible to devise approximation algorithms like those de- 
vised for other NP-complete problems. To the extent that word ordering is like solving 
the Traveling Salesman Problem, it is encouraging substantial progress continues to be 
made on Traveling Salesman algorithms. For example, it is often possible to get within 
two percent of the optimal tour in practice, and some researchers have demonstrated 
an optimal tour of over 13,000 U.S. cities. (The latter experiment relied on things like 
distance symmetry and the triangle inequality constraint, however, which do not hold 
in word ordering.) So far, statistical translation research has either opted for heuristic 
beam-search algorithms or different channel models. For example, some researchers 
avoid bag generation by preprocessing bilingual texts to remove word-order differ- 
ences, while others adopt channels that eliminate syntactically unlikely alignments. 
Finally, expensive decoding also suggests expensive training from unannotated 
(monolingual) texts, which presents a challenging bottleneck for extending statistical 
machine translation to language pairs and domains where large bilingual corpora do 
not exist. 
References 
Brew, Chris. 1992. Letting the cat out of the 
bag: Generation for shake-and-bake MT. 
In Proceedings of the 14th International 
Conference on Computational Linguistics 
(COLING), pages 610-616, Nantes, France, 
August. 
Brown, Peter, Stephen Della-Pietra, Vincent 
Della-Pietra, and Robert Mercer. 1993. The 
mathematics of statistical machine 
translation: Parameter estimation. 
Computational Linguistics, 19(2):263-311. 
Church, Kenneth. 1988. A stochastic parts 
program and noun phrase parser for 
unrestricted text. In Proceedings of the 2nd 
Conference on Applied Natural Language 
Processing, pages 136-143, Austin, TX, 
June. 
615 

