A Polynomial-Time Algorithm for Statistical Machine Translation 
Dekai Wu 
HKUST 
Department of Computer Science 
University of Science and Technology 
Clear Water Bay, Hong Kong 
dekai©cs, ust. hk 
Abstract 
We introduce a polynomial-time algorithm 
for statistical machine translation. This 
algorithm can be used in place of the 
expensive, slow best-first search strate- 
gies in current statistical translation ar- 
chitectures. The approach employs the 
stochastic bracketing transduction gram- 
mar (SBTG) model we recently introduced 
to replace earlier word alignment channel 
models, while retaining a bigram language 
model. The new algorithm in our experi- 
ence yields major speed improvement with 
no significant loss of accuracy. 
1 Motivation 
The statistical translation model introduced by IBM 
(Brown et al., 1990) views translation as a noisy 
channel process. Assume, as we do throughout this 
paper, that the input language is Chinese and the 
task is to translate into English. The underlying 
generative model, shown in Figure 1, contains a 
stochastic English sentence generator whose output 
is "corrupted" by the translation channel to produce 
Chinese sentences. In the IBM system, the language 
model employs simple n-grams, while the transla- 
tion model employs several sets of parameters as 
discussed below. Estimation of the parameters has 
been described elsewhere (Brown et al., 1993). 
Translation is performed in the reverse direction 
from generation, as usual for recognition under gen- 
erative models. For each Chinese sentence c that is 
to be translated, the system must attempt to find 
the English sentence e* such that: 
(1) e* = argmaxPr(elc ) e 
(2) = argmaxPr(cle ) Pr(e) e 
In the IBM model, the search for the optimal e* is 
performed using a best-first heuristic "stack search" 
similar to A* methods. 
One of the primary obstacles to making the statis- 
tical translation approach practical is slow speed of 
translation, as performed in A* fashion. This price is 
paid for the robustness that is obtained by using very 
flexible language and translation models. The lan- 
guage model allows sentences of arbitrary order and 
the translation model allows arbitrary word-order 
permutation. The models employ no structural con- 
straints, relying instead on probability parameters 
to assign low probabilities to implausible sentences. 
This exhaustive space, together with massive num- 
ber of parameters, permits greater modeling accu- 
racy. 
But while accuracy is enhanced, translation ef- 
ficiency suffers due to the lack of structure in the 
hypothesis space. The translation channel is char- 
acterized by two sets of parameters: translation and 
alignment probabilities3 The translation probabil- 
ities describe lexical substitution, while alignment 
probabilities describe word-order permutation. The 
key problem is that the formulation of alignment 
probabilities a(ilj, V, T) permits the Chinese word in 
position j of a length-T sentence to map to any po- 
sition i of a length-V English sentence. So V T align- 
ments are possible, yielding an exponential space 
with correspondingly slow search times. 
Note there are no explicit linguistic grammars in 
the IBM channel model. Useful methods do exist 
for incorporating constraints fed in from other pre- 
processing modules, and some of these modules do 
employ linguistic grammars. For instance, we previ- 
ously reported a method for improving search times 
in channel translation models that exploits bracket- 
ing information (Wu and Ng, 1995). If any brackets 
for the Chinese sentence can be supplied as addi- 
tional input information, produced for example by 
a preprocessing stage, a modified version of the A*- 
based algorithm can follow the brackets to guide the 
search heuristically. This strategy appears to pro- 
duces moderate improvements in search speed and 
slightly better translations. 
Such linguistic-preprocessing techniques could 
1Various models have been constructed by the IBM 
team (Brown et al., 1993). This description corresponds 
to one of the simplest ones, "Model 2"; search costs for 
the more complex models are correspondingly higher. 
152 
stochastic English 
generator 
English i Chinese strings I 
noisy strings 
\[ channel 
i J 
k 
direction of generative model ---~-~ 
< -- direction of translation 
Figure 1: Channel translation model. 
also be used with the new model described below, 
but the issue is independent of our focus here. In 
this paper we address the underlying assumptions 
of core channel model itself which does not directly 
use linguistic structure. 
A slightly different model is employed for a 
word alignment application by Dagan et al. (Da- 
gan, Church, and Gale, 1993). Instead of alignment 
probabilities, offset probabilities o(k) are employed, 
where k is essentially the positional distance between 
the English words aligned to two adjacent Chinese 
words: 
(3) k = i - (A(jpreo) + (j - jp~ev)N) 
where jpr~v is the position of the immediately pre- 
ceding Chinese word and N is a constant that nor- 
malizes for average sentence lengths in different lan- 
guages. The motivation is that words that are close 
to each other in the Chinese sentence should tend 
to be close in the English sentence as well. The 
size of the parameter set is greatly reduced from 
the lil x IJl x ITI x Iv I parameters of the alignment 
probabilities, down to a small set of Ikl parameters. 
However, the search space remains the same. 
The A*-style stack-decoding approach is in some 
ways a carryover from the speech recognition archi- 
tectures that inspired the channel translation model. 
It has proven highly effective for speech recognition 
in both accuracy and speed, where the search space 
contains no order variation since the acoustic and 
text streams can be assumed to be linearly aligned. 
But in contrast, for translation models the stack 
search alone does not adequately compensate for 
the combinatorially more complex space that results 
from permitting arbitrary order variations. Indeed, 
the stack-decoding approach remains impractically 
slow for translation, and has not achieved the same 
kind of speed as for speech recognition. 
The model we describe in this paper, like Dagan 
et al.'s model, encourages related words to stay to- 
gether, and reduces the number of parameters used 
to describe word-order variation. But more impor- 
tantly, it makes structural assumptions that elimi- 
nate large portions of the space of alignments, based 
on linguistic motivatations. This greatly reduces the 
search space and makes possible a polynomial-time 
optimization algorithm. 
2 ITG and BTG Overview 
The new translation model is based on the recently 
introduced bilingual language modeling approach. 
Specifically, the model employs a bracketing trans- 
duction grammar or BTG (Wu, 1995a), which is 
a special case of inversion transduction grammars 
or ITGs (Wu, 1995c; Wu, 1995c; Wu, 1995b; Wu, 
1995d). These formalisms were originally developed 
for the purpose of parallel corpus annotation, with 
applications for bracketing, alignment, and segmen- 
tation. This paper finds they are also useful for the 
translation system itself. In this section we summa- 
rize the main properties of BTGs and ITGs. 
An ITG consists of context-free productions where 
terminal symbols come in couples, for example x/y, 
where z is a Chinese word and y is an English trans- 
lation of x. 2 Any parse tree thus generates two 
strings, one on the Chinese stream and one on the 
English stream. Thus, the tree: 
(1) \[~/I liST/took \[--/a $:/e ~t/book\]Np \]vP \[,,~/for ~/you\]pp \]vP Is 
produces, for example, the mutual translations: 
(2) a. \[~ \[\[ST \[--*~\]NP \]vP \[~\]PP \]vP Is \[W6 \[\[nA le \[yi b~n shfi\]Np \]vp \[g@i ni\]pp \]vP 
\]s 
b. \[I \[\[took \[a book\]Np \]vP \[for you\]pp \]vP Is 
An additional mechanism accommodates a con- 
servative degree of word-order variation between the 
two languages. With each production of the gram- 
mar is associated either a straight orientation or an 
inverted orientation, respectively denoted as follows: 
VP --~ \[VP PP\] 
VP ---* (VP PP) 
In the case of a production with straight orien- 
tation, the right-hand-side symbols are visited left- 
to-right for both the Chinese and English streams. 
But for a production with inverted orientation, the 
2Readers of the papers cited above should note that 
we have switched the roles of English and Chinese here, 
which helps simplify the presentation of the new trans- 
lation algorithm. 
153 
BTG all matchings ratio 
1 1 1.000 
1 1 1 1200 
2 2 2 1.000 
3 6 6 1.000 
4 22 24 0.917 
5 90 120 0.750 
6 394 720 0.547 
7 1806 5040 0.358 
8 8558 40320 0.212 
9 41586 362880 0.115 
10 206098 3628800 0.057 
11 1037718 39916800 0.026 
12 5293446 479001600 0.011 
13 27297738 6227020800 0.004 
14 142078746 87178291200 0.002 
15 745387038 1307674368000 0.001 
16 3937603038 20922789888000 0.000 
Figure 2: Number of legal word alignments between 
sentences of length f, with and without the BTG 
restriction. 
right-hand-side symbols are visited left-to-right for 
Chinese and right-to-left for English. Thus, the tree: 
(3) \[~/I (\[,.~/for ~/you\]pp \[$~'/took \[--/a ak/e 
~idt/book\]Np \]vp )vP \]s 
produces translations with different word order: 
(4) a. \[~J~ \[\[,,~*l~\]pp \[~Y \[--2\[~-~\]Np \]VP \]VP \]S 
b. \[I \[\[took \[a book\]Np \]vP \[for you\]pp \]vP \]s 
In the special case of BTGs which are employed 
in the model presented below, there is only one un- 
differentiated nonterminal category (aside from the 
start symbol). Designating this category A, this 
means all non-lexical productions are of one of these 
two forms: 
A ---+ \[AA...A\] 
A ---+ (AA...A} 
The degree of word-order flexibility is the criti- 
cal point. BTGs make a favorable trade-off between 
efficiency and expressiveness: constraints are strong 
enough to allow algorithms to operate efficiently, but 
without so much loss of expressiveness as to hinder 
useful translation. We summarize here; details are 
given elsewhere (Wu, 1995b). 
With regard to efficiency, Figure 2 demonstrates 
the kind of reduction that BTGs obtain in the space 
of possible alignments. The number of possible 
alignments, compared against the unrestricted case 
where any English word may align to any Chinese 
position, drops off dramatically for strings longer 
than four words. (This table makes the simplifica- 
tion of counting only 1-1 matchings and is merely 
representative.) 
With regard to expressiveness, we believe that al- 
most all variation in the order of arguments in a 
syntactic frame can be accommodated, a Syntac- 
tic frames generally contain four or fewer subcon- 
stituents. Figure 2 shows that for the case of four 
subconstituents, BTGs permit 22 out of the 24 pos- 
sible alignments. The only prohibited arrangements 
are "inside-out" transformations (Wu, 1995b), which 
we have been unable to find any examples of in our 
corpus. Moreover, extremely distorted alignments 
can be handled by BTGs (Wu, 1995c), without re- 
sorting to the unrestricted-alignment model. 
The translation expressiveness of BTGs is by no 
means perfect. They are nonetheless proving very 
useful in applications and are substantially more fea- 
sible than previous models. In our previous corpus 
analysis applications, any expressiveness limitations 
were easily tolerable since degradation was graceful. 
In the present translation application, any expres- 
siveness limitation simply means that certain trans- 
lations are not considered. 
For the remainder of the paper, we take advantage 
of a convenient normal-form theorem (Wu, 1995a) 
that allows us to assume without loss of generality 
that the BTG only contains the binary-branching 
form for the non-lexicM productions. 4 
3 BTG-Based Search for the 
Original Models 
A first approach to improving the translation search 
is to limit the allowed word alignment patterns to 
those permitted by a BTG. In this case, Equation (2) 
is kept as the objective function and the translation 
channel can be parameterized similarly to Dagan et 
al. (Dagan, Church, and Gale, 1993). The effect of 
the BTG restriction is just to constrain the shapes of 
the word-order distortions. A BTG rather than ITG 
is used since, as we discussed earlier, pure channel 
translation models operate without explicit gram- 
mars, providing no constituent categories around 
which a more sophisticated ITG could be structured. 
But the structural constraints of the BTG can im- 
prove search efficiency, even without differentiated 
constituent categories. Just as in the baseline sys- 
tem, we rely on the language and translation models 
to take up the slack in place of an explicit grammar. 
In this approach, an O(T 7) algorithm similar to the 
one described later can be constructed to replace A* 
search. 
3Note that these points are not directed at free word- 
order languages. But in such languages, explicit mor- 
phological inflections make role identification and trans- 
lation easier. 
4But see the conclusion for a caveat. 
154 
However we do not feel it is worth preserving off- 
set (or alignment or distortion) parameters simply 
for the sake of preserving the original translation 
channel model. These parameterizations were only 
intended to crudely model word-order variation. In- 
stead, the BTG itself can be used directly to proba- 
bilistically rank alternative alignments, as described 
next. 
4 Replacing the Channel Model 
with a SBTG 
The second possibility is to use a stochastic brack- 
eting transduction grammar (SBTG) in the channel 
model, replacing the translation model altogether. 
In a SBTG, a probability is associated with each pro- 
duction. Thus for the normal-form BTG, we have: 
The translation lexicon is encoded in productions of 
a T \] g \[AA\] 
aO A -+ (A A) 
b(x,y) A ~ x/y 
5(~ e) A ~ z/e 
b(qu) A --+ ely 
for all x, y lexical translations 
for all x Chinese vocabulary 
for all y English vocabulary 
the third kind. The latter two kinds of productions 
allow words of either Chinese or English to go un- 
matched. 
The SBTG assigns a probability Pr(c, e, q) to all 
generable trees q and sentence-pairs. In principle 
it can be used as the translation channel model by 
normalizing with Pr(e) and integrating out Pr(q) to 
give Pr(cle ) in Equation (2). In practice, a strong 
language model makes this unnecessary, so we can 
instead optimize the simpler Viterbi approximation 
(4) e* = argmaxPr(c, e, q) Pr(e) 
e 
To complete the picture we add a bigram model 
ge~-lej = g(ej lej_l) for the English language model 
Pr(e). 
Offset, alignment, or distortion parameters are 
entirely eliminated. A large part of the im- 
plicit function of such parameters--to prevent align- 
ments where too many frame arguments become 
separated--is rendered unnecessary by the BTG's 
structural constraints, which prohibit many such 
configurations altogether. Another part of the pa- 
rameters' ~urpose is subsumed by the SBTG's prob- 
abilities at\] and a0, which can be set to prefer 
straight or inverted orientation depending on the 
language pair. As in the original models, the lan- 
guage model heavily influences the remaining order- 
ing decisions. 
Matters are complicated by the presence of the bi- 
gram model in the objective function (which word- 
alignment models, as opposed to translation models, 
do not need to deal with). As in our word-alignment 
model, the translation algorithm optimizes Equa- 
tion (4) via dynamic programming, similar to chart 
parsing (Earley, 1970) but with a probabilistic ob- 
jective function as for HMMs (Viterbi, 1967). But 
unlike the word-alignment model, to accommodate 
the bigram model we introduce indexes in the recur- 
rence not only on subtrees over the source Chinese 
string, but also on the delimiting words of the target 
English substrings. 
Another feature of the algorithm is that segmen- 
tation of the Chinese input sentence is performed 
in parallel with the translation search. Conven- 
tional architectures for Chinese NLP generally at- 
tempt to identify word boundaries as a preprocess- 
ing stage. 5 Whenever the segmentation preprocessor 
prematurely commits to an inappropriate segmenta- 
tion, difficulties are created for later stages. This 
problem is particularly acute for translation, since 
the decision as to whether to regard a sequence as a 
single unit depends on whether its components can 
be translated compositionally. This in turn often 
depends on what the target language is. In other 
words, the Chinese cannot be appropriately seg- 
mented except with respect to the target language of 
translation--a task-driven definition of correct seg- 
mentation. 
The algorithm is given below. A few remarks 
about the notation used: c~..t denotes the subse- 
quence of Chinese tokens cs+t, cs+2, • • • , ct. We use 
E(s..t) to denote the set of English words that are 
translations the Chinese word created by taking all 
tokens in c,..t together. E(s,t) denotes the set of 
English words that are translations of any of the 
Chinese words anywhere within c,..t. Note also that 
we assume the explicit sentence-start and sentence- 
end tokens co = <s> and CT+l = </s>, which makes 
the algorithm description more parsimonious. Fi- 
nally, the argmax operator is generalized to vector 
notation to accomodate multiple indices. 
1. Initialization 
o • O<s<t<T 6~trr(~) = b~(c~..t/Y), :~ ~ E(s..-t) 
2. Recursion For all s,t,y,z such that { -1_<s<t_<T+1 
~E(8,t) 
zEE(s,t) 
6,~v~ maxrx\[l xO x0 1 = ==~ tVstyz ~ Vstyz ~ VstyzJ 
2 if 6 \[1 "-6 0 and 6 \[\] 0 \[\] ~ty~ - st~ ,tyz > 6sty~ 
Ostyz : if 6 0 "~6 \[\] " and 6 0 o styz ! styz styz > 6styz 
otherwise 
5Written Chinese contains no spaces to delimit words; 
any spaces in the earlier examples are artifacts of the 
parse tree brackets. 
155 
Category 
Correct 
Incorrect 
Original A* Bracket A* BTG-Channel 
67.5 69.8 68.2 
32.5 30.2 31.8 
Figure 3: Translation accuracy (percentage correct). 
where 
6\[\] a \[ \] ,iv. = max ,~sSyY 6StZz gYZ s<S<t 
YeE(s,S) ZEE(S,t) \[ \] 
\[1 ~bstyz 
\[1 uJ styz 
6O styz 
argmax 
s<S<t YfE(s,S) 
ZEE(S,t) 
max s<S<t 
YeE(S,t) ZEE(s,S) 
a\[\] 6,syY 6stz~ gvz 
a 0 ~,sz~ 6StyY gYZ 
styz 0 
Cstvz = argmax a 0 ~sszz(j) 6styy(k) gYz 0 s<s<t 
Wstyz YEE(S,t) zeE(,,s) 
3. Reconstruction Initialize by setting the root 
of the parse tree to q0 = (-1, T- 1, <s>, </s>). The 
remaining descendants in the optimal parse tree are 
then given recursively for any q = (s,t, y, z) by: 
a probabilistic optimization problem. But perhaps 
most importantly, our goal is to constrain as tightly 
as possible the space of possible transduction rela- 
tionships between two languages with fixed word- 
order, making no other language-specific assump- 
tions; we are thus driven to seek a kind of language- 
universal property. In contrast, the ID/LP work 
was directed at parsing a single language with free 
word-order. As a consequence, it would be neces- 
sary to enumerate a specific set of linear-precedence 
(LP) relations for the language, and moreover the 
immediate-dominance (ID) productions would typi- 
cally be more complex than binary-branching. This 
significantly increases time complexity, compared to 
our BTG model. Although it is not mentioned in 
their paper, the time complexity for ID/LP pars- 
ing rises exponentially with the length of produc- 
tion right-hand-sides, due to the number of permuta- 
tions. ITGs avoid this with their restriction to inver- 
sions, rather than permutations, and BTGs further 
minimize the grammar size. We have also confirmed 
empirically that our models would not be feasible 
under general permutations. 
LEFT(q) 
RIGHT(q) 
NIL if t-s<1 
(s,a \[1 " ,,,\[1~ ifOq \[\] = q , ,Y, ~Yq ) ~- 
(s,a 0q,w 0q,z~j if0q=0 
NIL otherwise 
NIL if ~-,<1 
= (g~\],t,w~\],z) if0q = \[\] 
(a~),t, y, ¢~)) if Oq = 0 
NIL otherwise 
Assume the number of translations per word is 
bounded by some constant. Then the maximum size 
of E(s,t) is proportional to t - s. The asymptotic 
time complexity for the translation algorithm is thus 
bounded by O(T7). Note that in practice, actual 
performance is improved by the sparseness of the 
translation matrix. 
An interesting connection has been suggested to 
direct parsing for ID/LP grammars (Shieber, 1984), 
in which word-order variations would be accommo- 
dated by the parser, and related ideas for genera- 
tion of free word-order languages in the TAG frame- 
work (Joshi, 1987). Our work differs from the ID/LP 
work in several important respects. First, we are not 
merely parsing, but translating with a bigram lan- 
guage model. Also, of course, we are dealing with 
5 Results 
The algorithm above was tested in the SILC transla- 
tion system. The translation lexicon was largely con- 
structed by training on the HKUST English-Chinese 
Parallel Bilingual Corpus, which consists of govern- 
mental transcripts. The corpus was sentence-aligned 
statistically (Wu, 1994); Chinese words and colloca- 
tions were extracted (Fung and Wu, 1994; Wu and 
Fung, 1994); then translation pairs were learned via 
an EM procedure (Wu and Xia, 1995). The re- 
sulting English vocabulary is approximately 6,500 
words and the Chinese vocabulary is approximately 
5,500 words, with a many-to-many translation map- 
ping averaging 2.25 Chinese translations per English 
word. Due to the unsupervised training, the transla- 
tion lexicon contains noise and is only at about 86% 
percent weighted precision. 
With regard to accuracy, we merely wish to 
demonstrate that for statistical MT, accuracy is not 
significantly compromised by substituting our effi- 
cient optimization algorithm. It is not our purpose 
here to argue that accuracy can be increased with 
our model. No morphological processing has been 
used to correct the output, and until now we have 
only been testing with a bigram model trained on 
extremely limited samples. A coarse evaluation of 
156 
Input: 
Output: 
Corpus: 
Input: 
Output: 
Corpus: 
Input: 
Output: 
Corpus: 
Input: 
Output: 
Corpus: 
Input: 
Output: 
Corpus: 
(Xigng g~mg de ~n dlng f~n r6ng shl w6 m~n sh~ng hu6 fgmg shi de zhi zh~.) 
Hong Kong's stabilize boom is us life styles's pillar. 
Our prosperity and stability underpin our way of life. 
(B6n g~ng de jing ji qian jing yfi zhSng gu6, t~ bi~ shl gu~ng dSng shrug de ring 
jl qi£n jing xi xi xi~ng gu~n.) 
Hong Kong's economic foreground with China, particular Guangdong province's 
economic foreground vitally interrelated. 
Our economic future is inextricably bound up with China, and with Guangdong 
Province in particular. 
(W6 wgm qu£n zhi chi ta de yl jign.) 
I absolutely uphold his views. 
I fully support his views. 
(Zh~ xi~ gn pdi k~ ji~ qi£ng w6 m~n rl hbu w~i chi jin r6ng w6n ding de n~ng li.) 
These arrangements can enforce us future kept financial stabilization's competency. 
These arrangements will enhance our ability to maintain monetary stability in 
the years to come. 
(Bh gub, w6 xihn zhi k~ yi k6n ding de shuS, w6 m~n ji~ng hul ti gSng w~i d£ d~o 
g~ xihng zhfi yho mfl biao su6 xfi de jing f~i.) 
However, I now can certainty's say, will provide for us attain various dominant 
goal necessary's current expenditure. 
The consultation process is continuing but I can confirm now that the necessary 
funds will be made available to meet the key targets. 
Figure 4: Example translation outputs. 
translation accuracy was performed on a random 
sample drawn from Chinese sentences of fewer than 
20 words from the parallel corpus, the results of 
which are shown in Figure 3. We have judged only 
whether the correct meaning (as determined by the 
corresponding English sentence in the parallel cor- 
pus) is conveyed by the translation, paying particu- 
lar attention to word order, but otherwise ignoring 
morphological and function word choices. For com- 
parison, the accuracies from the A*-based systems 
are also shown. There is no significant difference 
in the accuracy. Some examples of the output are 
shown in Figure 4. 
On the other hand, the new algorithm has indeed 
proven to be much faster. At present we are unable 
to use direct measurement to compare the speed of 
the systems meaningfully, because of vast implemen- 
tational differences between the systems. However, 
the order-of-magnitude improvements are immedi- 
ately apparent. In the earlier system, translation of 
single sentences required on the order of hours (Sun 
Sparc 10 workstations). In contrast the new algo- 
rithm generally takes less than one minute--usually 
substantially less--with no special optimization of 
the code. 
6 Conclusion 
We have introduced a new algorithm for the run- 
time optimization step in statistical machine trans- 
lation systems, whose polynomial-time complexity 
addresses one of the primary obstacles to practicality 
facing statistical MT. The underlying model for the 
algorithm is a combination of the stochastic BTG 
and bigram models. The improvement in speed does 
not appear to impair accuracy significantly. 
We have implemented a version that accepts ITGs 
rather than BTGs, and plan to experiment with 
more heavily structured models. However, it is im- 
portant to note that the search complexity rises ex- 
ponentially rather than polynomially with the size of 
the grammar, just as for context-free parsing (Bar- 
ton, Berwick, and Ristad, 1987). This is not relevant 
to the BTG-based model we have described since its 
grammar size is fixed; in fact the BTG's minimal 
grammar size has been an important advantage over 
more linguistically-motivated ITG-based models. 
157 
We have also implemented a generalized version 
that accepts arbitrary grammars not restricted to 
normal form, with two motivations. The pragmatic 
benefit is that structured grammars become easier 
to write, and more concise. The expressiveness ben- 
efit is that a wider family of probability distribu- 
tions can be written. As stated earlier, the normal 
form theorem guarantees that the same set of shapes 
will be explored by our search algorithm, regardless 
of whether a binary-branching BTG or an arbitrary 
BTG is used. But it may sometimes be useful to 
place probabilities on n-ary productions that vary 
with n in a way that cannot be expressed by com- 
posing binary productions; for example one might 
wish to encourage longer straight productions. The 
generalized version permits such strategies. 
Currently we are evaluating robustness extensions 
of the algorithm that permit words suggested by the 
language model to be inserted in the output sen- 
tence, which the original A* algorithms permitted. 
Acknowledgements 
Thanks to an anonymous referee for valuable com- 
ments, and to the SILC group members: Xuanyin 
Xia, Eva Wai-Man Fong, Cindy Ng, Hong-sing 
Wong, and Daniel Ka-Leung Chan. Many thanks 
Mso to Kathleen McKeown and her group for dis- 
cussion, support, and assistance. 

References 
Barton, G. Edward, Robert C. Berwick, and 
Eric Sven Ristad. 1987. Computational Complex- 
ity and Natural Language. MIT Press, Cambridge, 
MA. 
Brown, Peter F., John Cocke, Stephen A. DellaPi- 
etra, Vincent J. DellaPietra, Frederick Jelinek, 
John D. Lafferty, Robert L. Mercer, and Paul S. 
Roossin. 1990. A statistical approach to machine 
translation. Computational Linguistics, 16(2):29- 
85. 
Brown, Peter F., Stephen A. DellaPietra, Vincent J. 
DellaPietra, and Robert L. Mercer. 1993. The 
mathematics of statisticM machine translation: 
Parameter estimation. Computational Linguis- 
tics, 19(2):263-311. 
Dagan, Ido, Kenneth W. Church, and William A. 
Gale. 1993. Robust bilingual word alignment 
for machine aided translation. In Proceedings of 
the Workshop on Very Large Corpora, pages 1-8, 
Columbus, OH, June. 
Earley, Jay. 1970. An efficient context-free pars- 
ing algorithm. Communications of the Associa- 
tion for Computing Machinery, 13(2):94-102. 
Fung, Pascale and Dekai Wu. 1994. Statistical aug- 
mentation of a Chinese machine-readable dictio- 
nary. In Proceedings of the Second Annual Work- 
shop on Very Large Corpora, pages 69-85, Kyoto, 
August. 
Joshi, Aravind K. 1987. Word-order variation in 
natural language generation. In Proceedings of 
AAAI-87, Sixth National Conference on Artificial 
Intelligence, pages 550-555. 
Shieber, Stuart M. 1984. Direct parsing of ID/LP 
grammars. Linguistics and Philosophy, 7:135- 
154. 
Viterbi, Andrew J. 1967. Error bounds for convolu- 
tional codes and an asymptotically optimal decod- 
ing Mgorithm. IEEE Transactions on Information 
Theory, 13:260-269. 
Wu, Dekai. 1994. Aligning a parallel English- 
Chinese corpus statistically with lexical criteria. 
In Proceedings of the 32nd Annual Conference 
of the Association for Computational Linguistics, 
pages 80-87, Las Cruces, New Mexico, June. 
Wu, Dekai. 1995a. An algorithm for simultaneously 
bracketing parallel texts by aligning words. In 
Proceedings of the 33rd Annual Conference of the 
Association for Computational Linguistics, pages 
244-251, Cambridge, Massachusetts, June. 
Wu, Dekai. 1995b. Grammarless extraction of 
phrasal translation examples from parallel texts. 
In TMI-95, Proceedings of the Sixth International 
Conference on Theoretical and Methodological Is- 
sues in Machine Translation, volume 2, pages 
354-372, Leuven, Belgium, July. 
Wu, Dekai. 1995c. Stochastic inversion trans- 
duction grammars, with application to segmen- 
tation, bracketing, and alignment of parallel cor- 
pora. In Proceedings of IJCAL95, Fourteenth In- 
ternational Joint Conference on Artificial Intelli- 
gence, pages 1328-1334, Montreal, August. 
Wu, Dekai. 1995d. Trainable coarse bilingual gram- 
mars for parMlel text bracketing. In Proceed- 
ings of the Third Annual Workshop on Very Large 
Corpora, pages 69-81, Cambridge, Massachusetts, 
June. 
Wu, Dekai and Pascale Fung. 1994. Improving 
Chinese tokenization with linguistic filters on sta- 
tistical lexicM acquisition. In Proceedings of the 
Fourth Conference on Applied Natural Language 
Processing, pages 180-181, Stuttgart, October. 
Wu, Dekai and Cindy Ng. 1995. Using brackets 
to improve search for statistical machine transla- 
tion. In PACLIC-IO, Pacific Asia Conference on 
Language, Information and Computation, pages 
195-204, Hong Kong, December. 
Wu, Dekai and Xuanyin Xia. 1995. Large-scale au- 
tomatic extraction of an English-Chinese lexicon. 
Machine Translation, 9(3-4):285-313. 
