Machine Translation with a Stochastic Grammatical Channel 
Dekai Wu and Hongsing WONG 
HKUST 
Human Language Technology Center 
Department of Computer Science 
University of Science and Technology 
Clear Water Bay, Hong Kong {dekai,wong}@cs.ust.hk 
Abstract 
We introduce a stochastic grammatical channel 
model for machine translation, that synthesizes sev- 
eral desirable characteristics of both statistical and 
grammatical machine translation. As with the 
pure statistical translation model described by Wu 
(1996) (in which a bracketing transduction gram- 
mar models the channel), alternative hypotheses 
compete probabilistically, exhaustive search of the 
translation hypothesis space can be performed in 
polynomial time, and robustness heuristics arise 
naturally from a language-independent inversion- 
transduction model. However, unlike pure statisti- 
cal translation models, the generated output string 
is guaranteed to conform to a given target gram- 
mar. The model employs only (1) a translation 
lexicon, (2) a context-free grammar for the target 
language, and (3) a bigram language model. The 
fact that no explicit bilingual translation rules are 
used makes the model easily portable to a variety of 
source languages. Initial experiments show that it 
also achieves significant speed gains over our ear- 
lier model. 
1 Motivation 
Speed of statistical machine translation methods 
has long been an issue. A step was taken by 
Wu (Wu, 1996) who introduced a polynomial-time 
algorithm for the runtime search for an optimal 
translation. To achieve this, Wu's method substi- 
tuted a language-independent stochastic bracketing 
transduction grammar (SBTG) in place of the sim- 
pler word-alignment channel models reviewed in 
Section 2. The SBTG channel made exhaustive 
search possible through dynamic programming, in- 
stead of previous "stack search" heuristics. Trans- 
lation accuracy was not compromised, because the 
SBTG is apparently flexible enough to model word- 
order variation (between English and Chinese) even 
though it eliminates large portions of the space of 
1408 
word alignments. The SBTG can be regarded as 
a model of the language-universal hypothesis that 
closely related arguments tend to stay together (Wu, 
1995a; Wu, 1995b). 
In this paper we introduce a generalization of 
Wu's method with the objectives of 
1. increasing translation speed further, 
2. improving meaning-preservation accuracy, 
3. improving grammaticality of the output, and 
4. seeding a natural transition toward transduc- 
tion rule models, 
under the constraint of 
• employing no additional knowledge resources 
except a grammar for the target language. 
To achieve these objectives, we: 
• replace Wu's SBTG channel with a full 
stochastic inversion transduction grammar or 
SITG channel, discussed in Section 3, and 
• (mis-)use the target language grammar as a 
SITG, discussed in Section 4. 
In Wu's SBTG method, the burden of generating 
grammatical output rests mostly on the bigram lan- 
guage model; explicit grammatical knowledge can- 
not be used. As a result, output grammaticality can- 
not be guaranteed. The advantage is that language- 
dependent syntactic knowledge resources are not 
needed. 
We relax those constraints here by assuming a 
good (monolingual) context-free grammar for the 
target language. Compared to other knowledge 
resources (such as transfer rules or semantic on- 
tologies), monolingual syntactic grammars are rel- 
atively easy to acquire or construct. We use the 
grammar in the SITG channel, while retaining the 
bigram language model. The new model facilitates 
explicit coding of grammatical knowledge and finer 
control over channel probabilities. Like Wu's SBTG 
model, the translation hypothesis space can be ex- 
haustively searched in polynomial time, as shown in 
Section 5. The experiments discussed in Section 6 
show promising results for these directions. 
2 Review: Noisy Channel Model 
The statistical translation model introduced by IBM 
(Brown et al., 1990) views translation as a noisy 
channel process. The underlying generative model 
contains a stochastic Chinese (input) sentence gen- 
erator whose output is "corrupted" by the transla- 
tion channel to produce English (output) sentences. 
Assume, as we do throughout this paper, that the 
input language is English and the task is to trans- 
late into Chinese. In the IBM system, the language 
model employs simple n-grams, while the transla- 
tion model employs several sets of parameters as 
discussed below. Estimation of the parameters has 
been described elsewhere (Brown et al., 1993). 
Translation is performed in the reverse direction 
from generation, as usual for recognition under gen- 
erative models. For each English sentence e to be 
translated, the system attempts to find the Chinese 
sentence c, such that: 
c* = argmaxPr(cle ) = argmaxPr(ele ) Pr(c) (1) g g 
In the IBM model, the search for the optimal c, is 
performed using a best-first heuristic "stack search" 
similar to A* methods. 
One of the primary obstacles to making the statis- 
tical translation approach practical is slow speed of 
translation, as performed in A* fashion. This price 
is paid for the robustness that is obtained by using 
very flexible language and translation models. The 
language model allows sentences of arbitrary or- 
der and the translation model allows arbitrary word- 
order permutation. No structural constraints and 
explicit linguistic grammars are imposed by this 
model. 
The translation channel is characterized by two 
sets of parameters: translation and alignment prob- 
abilities, l The translation probabilities describe lex- 
ical substitution, while alignment probabilities de- 
scribe word-order permutation. The key problem 
is that the formulation of alignment probabilities 
a(ilj , V, T) permits the English word in position j 
of a length-T sentence to map to any position i of a 
length-V Chinese sentence. So V T alignments are 
possible, yielding an exponential space with corre- 
spondingly slow search times. 
I Various models have been constructed by the IBM team 
(Brown et al., 1993). This description corresponds to one of the 
simplest ones, "Model 2"; search costs for the more complex 
models are correspondingly higher. 
3 A SITG Channel Model 
The translation channel we propose is based on 
the recently introduced bilingual language model- 
ing approach. The model employs a stochastic ver- 
sion of an inversion transduction grammar or ITG 
(Wu, 1995c; Wu, 1995d; Wu, 1997). This formal- 
ism was originally developed for the purpose of par- 
allel corpus annotation, with applications for brack- 
eting, alignment, and segmentation. Subsequently, 
a method was developed to use a special case of the 
ITGRthe aforementioned BTGRfor the translation 
task itself (Wu, 1996). The next few paragraphs 
briefly review the main properties of ITGs, before 
we describe the SITG channel. 
An ITG consists of context-free productions 
where terminal symbols come in couples, for ex- 
ample x/y, where x is a English word and y is an 
Chinese translation of x, with singletons of the form 
x/e or e/y representing function words that are used 
in only one of the languages. Any parse tree thus 
generates both English and Chinese strings simulta- 
neously. Thus, the tree: 
(1) \[I/~-~ \[\[took/$-~ \[a/-- e/:~s: book/:~\]N P \]vP 
\[for/.~ you/~J~\]pp \]VP Is produces, for example, the mutual translations: 
(2) a. \[~ \[\[~~ \[--:~\]NP \]VP \[,,~'{~\]PP \]VP \]S 
b. \[I \[\[took \[a book\]Nv \]va \[for you\]pp \]vp \]s 
An additional mechanism accommodates a con- 
servative degree of word-order variation between 
the two languages. With each production of the 
grammar is associated either a straight orientation 
or an inverted orientation, respectively denoted as 
follows: VP ~ \[VPPP\] 
VP ~ (VPPP) 
In the case of a production with straight orien- 
tation, the right-hand-side symbols are visited left- 
to-right for both the English and Chinese streams. 
But for a production with inverted orientation, the 
right-hand-side symbols are visited left-to-right for 
English and right-to-left for Chinese. Thus, the tree: 
(3) \[I/~ (\[took/~T \[a/-- e/:~ book\]--~\]N P \]VP 
\[for/,,~ you/~J~\]pp)vp \]S 
produces translations with different word order: 
(4) a. \[I \[\[took \[a book\]Np \]vP \[for you\]pp \]vp \]s 
b. \[~ \[\[.~/~\]pp \[~7 \[--2~\]NP \]VP \]VP \]S 
The surprising ability of ITGs to accommodate 
nearly all word-order variation between fixed-word- 
order languages 2 (English and Chinese in particu- 
lar), has been analyzed mathematically, linguisti- 
2With the exception of higher-order phenomena such as 
neg-raising and wh-movement. 
1409 
cally, and experimentally (Wu, 1995b; Wu, 1997). 
Any ITG can be transformed to an equivalent 
binary-branching normal form. 
A stochastic ITG associates a probability with 
each production. It follows that a SITG assigns 
a probability Pr(e,c,q) to all generable trees q 
and sentence-pairs. In principle it can be used as 
the translation channel model by normalizing with 
Pr(c) and integrating out Pr(q) to give Pr(clc) in 
Equation (1). In practice, a strong language model 
makes this unnecessary, so we can instead optimize 
the simpler Viterbi approximation 
c, = argmaxPr(e,c,q) Pr(c) (2) 
c 
To complete the picture we add a bigram model 
gc~_~c~ = g(cj \] cj-1) for the Chinese language 
model Pr(c). 
This approach was used for the SBTG chan- 
nel (Wu, 1996), using the language-independent 
bracketing degenerate case of the SITG: 3 
all A -4 \[AA\] 
aO A --+ (AA) 
A b(54Y) x/y VX, y lexical translations 
A b(.~¢) .z'/~? VX language 1 vocabulary 
A b(_~y) e/y Vy language 2 vocabulary 
In the proposed model, a structured language- 
dependent ITG is used instead. 
4 A Grammatical Channel Model 
Stated radically, our novel modeling thesis is that 
a mirrored version of the target language grammar 
can parse sentences of the source language. 
Ideally, an ITG would be tailored for the desired 
source and target languages, enumerating the trans- 
duction patterns specific to that language pair. Con- 
structing such an ITG, however, requires massive 
manual labor effort for each language pair. Instead, 
our approach is to take a more readily acquired 
monolingual context-free grammar for the target 
language, and use (or perhaps misuse) it in the SITG 
channel, by employing the three tactics described 
below: production mirroring, part-of-speech map- 
ping, and word skipping. 
In the following, keep in mind our convention 
that language 1 is the source (English), while lan- 
guage 2 is the target (Chinese). 
3Wu (Wu, 1996) experimented with Chinese-English trans- 
lation, while this paper experiments with English-Chinese 
translation. 
1410 
S -4 NPVPPunc 
VP -4 V NP 
NP -4 NModNIPm 
S ~ \[NP VP Punc\] / (Punc VP NP) 
VP -4 \[VNP\]I(NPV) 
NP -4 \[N Mod N\] I (N Mod N) I \[Prn\] 
Figure 1: An input CFG and its mirrored ITG. 
4.1 Production Mirroring 
The first step is to convert the monolingual Chi- 
nese CFG to a bilingual ITG. The production mir- 
roring tactic simply doubles the number of pro- 
ductions, transforming every monolingual produc- 
tion into two bilingual productions, 4 one straight 
and one inverted, as for example in Figure 1 where 
the upper Chinese CFG becomes the lower ITG. 
The intent of the mirroring is to add enough flex- 
ibility to allow parsing of English sentences using 
the language 1 side of the ITG. The extra produc- 
tions accommodate reversed subconstituent order in 
the source language's constituents, at the same time 
restricting the language 2 output sentence to con- 
form the given target grammar whether straight or 
inverted productions are used. 
The following example illustrates how produc- 
tion mirroring works. Consider the input sentence 
He is the son of Stephen, which can be parsed by 
the ITG of Figure 1 to yield the corresponding out- 
put sentence ~~1~:~, with the following 
parse tree: 
(5) \[\[\[He/{~ \]Pro\]No \[\[is/~ \]v \[the/e\]NOlSE 
(\[son/~\]N \[of/~\]Moa \[Stephen/~ff \]N 
)NP\]VP \[.\]o \]Punc \]S 
Production mirroring produced the inverted NP 
constituent which was necessary to parse son of 
Stephen, i.e., (son/.~ of/flcJ Stephen/~)Np. 
If the target CFG is purely binary branching, 
then the previous theoretical and linguistic analy- 
ses (Wu, 1997) suggest that much of the requisite 
constituent and word order transposition may be ac- 
commodated without change to the mirrored ITG. 
On the other hand, if the target CFG contains pro- 
ductions with long right-hand-sides, then merely in- 
verting the subconstituent order will probably be in- 
sufficient. In such cases, a more complex transfor- 
mation heuristic would be needed. 
Objective 3 (improving grammaticality of the 
output) can be directly tackled by using a tight tar- 
4Except for unary productions, which yield only one bilin- 
gual production. 
get grammar. To see this, consider using a mir- 
rored Chinese CFG to parse English sentences with 
the language 1 side of the ITG. Any resulting parse 
tree must be consistent with the original Chinese 
grammar. This follows from the fact that both the 
straight and inverted versions of a production have 
language 2 (Chinese) sides identical to the original 
monolingual production: inverting production ori- 
entation cancels out the mirroring of the right-hand- 
side symbols. Thus, the output grammaticality de- 
pends directly on the tightness of the original Chi- 
nese grammar. 
In principle, with this approach a single tar- 
get grammar could be used for translation from 
any number of other (fixed word-order) source lan- 
guages, so long as a translation lexicon is available 
for each source language. 
Probabilities on the mirrored ITG cannot be re- 
liably estimated from bilingual data without a very 
large parallel corpus. A straightforward approxima- 
tion is to employ EM or Viterbi training on just a 
monolingual target language (Chinese) corpus. 
4.2 Part-of-Speech Mapping 
The second problem is that the part-of-speech (PoS) 
categories used by the target (Chinese) grammar do 
not correspond to the source (English) words when 
the source sentence is parsed. It is unlikely that any 
English lexicon will list Chinese parts-of-speech. 
We employ a simple part-of-speech mapping 
technique that allows the PoS tag of any corre- 
sponding word in the target language (as found in 
the translation lexicon) to serve as a proxy for the 
source word's PoS. The word view, for example, 
may be tagged with the Chinese tags nc and vn, 
since the translation lexicon holds both viewyy/~ 
~nc and viewvB/~vn. 
Unknown English words must be handled differ- 
ently since they cannot be looked up in the transla- 
tion lexicon. The English PoS tag is first found by 
tagging the English sentence. A set of possible cor- 
responding Chinese PoS tags is then found by table 
lookup (using a small hand-constructed mapping ta- 
ble). For example, NN may map to nc, loc and pref, 
while VB may map to vi, vn, vp, vv, vs, etc. This 
method generates many hypotheses and should only 
be used as a last resort. 
4.3 Word Skipping 
Regardless of how constituent-order transposition is 
handled, some function words simply do not oc- 
cur in both languages, for example Chinese aspect 
1411 
markers. This is the rationale for the singletons 
mentioned in Section 3. 
If we create an explicit singleton hypothesis for 
every possible input word, the resulting search 
space will be too large. To recognize singletons, we 
instead borrow the word-skipping technique from 
speech recognition and robust parsing. As formal- 
ized in the next section, we can do this by modifying 
the item extension step in our chart-parser-like algo- 
rithm. When the dot of an item is on the rightmost 
position, we can use such constituent, a subtree, to 
extend other items. In chart parsing, the valid sub- 
trees that can be used to extend an item are those 
that are located on the adjacent right of the dot po- 
sition of the item and the anticipated category of the 
item should also be equal to that of the subtrees. 
If word-skipping is to be used, the valid subtrees 
can be located a few positions right (or, left for the 
item corresponding to inverted production) to the 
dot position of the item. In other words, words be- 
tween the dot position and the start of the subtee are 
skipped, and considered to be singletons. 
Consider Sentence 5 again. Word-skipping han- 
dled the the which has no Chinese counterpart. At a 
certain point during translation, we have the follow- 
ing item: VP--+\[is/x~\]veNP. With word-skipping, 
it can be extended to VP --+\[is/x~\]vNPe by the sub- 
tree (son/~ of/~ Stephen/~)Np, even the 
subtree is not adjacent (but within a certain distance, 
see Section 5) to the dot position of the item. The 
the located on the adjacent to the dot position of the 
item is skipped. 
Word-skipping provides us the flexibility to parse 
the source input by skipping possible singleton(s), 
if when we doing so, the source input can be parsed 
with the highest likelihood, and grammatical output 
can be produced. 
5 Translation Algorithm 
The translation search algorithm differs from that of 
Wu's SBTG model in that it handles arbitrary gram- 
mars rather than binary bracketing grammars. As 
such it is more similar to active chart parsing (Ear- 
ley, 1970) rather than CYK parsing (Kasami, 1965; 
Younger, 1967). We take the standard notion of 
items (Aho and Ullman, 1972), and use the term an- 
ticipation to mean an item which still has symbols 
right of its dot. Items that don't have any symbols 
right of the dot are called subtree. 
As with Wu's SBTG model, the algorithm max- 
imizes a probabilistic objective function, Equa- 
tion (2), using dynamic programming similar to that 
for HMM recognition (Viterbi, 1967). The presence 
of the bigram model in the objective function ne- 
cessitates indexes in the recurrence not only on sub- 
trees over the source English string, but also on the 
delimiting words of the target Chinese substrings. 
The dynamic programming exploits a recursive 
formulation of the objective function as follows. 
Some notation remarks: es..t denotes the subse- 
quence of English tokens e,+l, e~+2, • •., et. We 
use C(s..t) to denote the set of Chinese words that 
are translations of the English word created by tak- 
ing all tokens in es..t together. C(s, t) denotes the 
set of Chinese words that are translations of any of 
the English words anywhere within es..t. K is the 
maximium number of consecutive English words 
that can be skipped. 5 Finally, the argmax operator is 
generalized to vector notation to accommodate mul- 
tiple indices. 
1. Initialization 
60rstYy = bi(es..¢/Y), 
O<s<t<T 
Y e c(s..t) 
r is Y's PoS 
2. Recursion 
For all r, s, t, u, v such that 
r is the category of a constituent spanning s to t 
0_<s<t<T 
u, v are the l--eftmost/rightmost words of the constituent 
(~,'stuv 
"\[rstuv 
= maxr6\[\] ,6 0 x• 1 • t rstuv rstuv, t'rstuvJ 
-0 ~o rstuv 
-- ma, r6\[\] 0 if6~{t~,o > , "t rst~,~, 
0 otherwise 
where 6 
: 
r\[\] r$tu~' 
nl ax 
8, <t t ~S,ael O<s)+l--tt<K 
= argmax 
S, <t, <-%+1 O<s,+l-t,<K 
ai(r) fl dr,s,t,u,v, gv,u,+, 
i=0 
rl 
ai(r) H ~rls|tlttlvlffvlttt'kl 
i=0 
Sln our experiments, It" was set to 4 
%0 = s, sn = t, u• = u, vn ~ v, gv,u,+a = gv,+lun : 
1, qi = (riaitiuivi) 
1412 
~0 r.~tuv ~ 
0 7"rstu v 
max r-+(ro...rn) 
s,<t, ~.%+X 
O<s,+I-G<_K 
= argmax 
r-+(~o ..... ) 
s,<tt<_s,-t-1 O<s,+x-t,<_K 
ai(r) fl ~r,s,t,u,v, 9v,+lu, 
i=O 
n 
ai(r) H ~ .... t,u,v,ffv,+,u, 
i=0 
3. Reconstruction 
Let qo = (S, 0, T, u, v) be the optimal root. where 
(u, v) = maxu, vEC(O.T) ~S st U v For any child of 
q = (r, s, t, u, v) is given by: 
{ r~ \] "\[\] , ifTq=\[\] A.risitiuivi 
CHILD(q, r) : 7-~) 0 ifTq 0 ~risitiuivi; "- 
NIL otherwise 
Assuming the number of translation per word is 
bounded by some constant, then the maximum size 
of C(s, t) is proportional to t - s. The asymptotic 
time complexity for our algorithm is thus bounded 
by O(Tr). However, note that in theory the com- 
plexity upper bound rises exponentially rather than 
polynomially with the size of the grammar, just 
as for context-free parsing (Barton et al., 1987), 
whereas this is not a problem for Wu's SBTG algo- 
rithm. In practice, natural language grammars are 
usually sufficiently constrained so that speed is ac- 
tually improved over the SBTG algorithm, as dis- 
cussed later. 
The dynamic programming is efficiently im- 
plemented by an active-chart-parser-style agenda- 
based algorithm, sketched as follows: 
1. Initialization For each word in the input sentence, put a 
subtree with category equal to the PoS of its translation 
into the agenda. 
2. Recursion Loop while agenda is not empty: 
(a) If the current item is a subtree of category X, ex- 
tend existing anticipations by calling ANTIEIPA- 
TIONEXTENSION, For each rule in the grammar 
of Z ~ XW... Y, add an initial anticipation of 
the form Z ~ X • W... Y and put it into the 
agenda. Add subtree X to the chart. 
(b) If the current item is an anticipation of the form 
Z ~ W...*X... Y from s to to, find all subtrees 
in the chart with category X that start at position t~ 
and use each subtree to extend this anticipation by 
calling ANTICIPATIONEXTENSION. 
ANTICIPATIONEXTENS1ON : Assuming the subtree we 
found is of category X from position sl to t, for any 
anticipation of the form Z --+ W... • X ... Y from so 
to \[sl-If, sl\], extend it to Z --+ IV... X • ... Y with 
span from so to t and add it to the agenda. 
3. Reconstruction The output string is recursively recon- 
structed from the highest likelihood subtree, with cate- 
gory S, that span the whole input sentence. 
6 Results 
The grammatical channel was tested in the SILC 
translation system. The translation lexicon was 
partly constructed by training on government tran- 
scripts from the HKUST English-Chinese Paral- 
lel Bilingual Corpus, and partly entered by hand. 
The corpus was sentence-aligned statistically (Wu, 
1994); Chinese words and collocations were ex- 
tracted (Fung and Wu, 1994; Wu and Fung, 1994); 
then translation pairs were learned via an EM pro- 
cedure (Wu and Xia, 1995). Together with hand- 
constructed entries, the resulting English vocabu- 
lary is approximately 9,500 words and the Chinese 
vocabulary is approximately 14,500 words, with a 
many-to-many translation mapping averaging 2.56 
Chinese translations per English word. Since the 
lexicon's content is mixed, we approximate transla- 
tion probabilities by using the unigram distribution 
of the target vocabulary from a small monolingual 
corpus. Noise still exists in the lexicon. 
The Chinese grammar we used is not tight-- 
it was written for robust parsing purposes, and as 
such it over-generates. Because of this we have not 
yet been able to conduct a fair quantitative assess- 
ment of objective 3. Our productions were con- 
structed with reference to a standard grammar (Bei- 
jing Language and Culture Univ., 1996) and totalled 
316 productions. Not all the original productions 
are mirrored, since some (128) are unary produc- 
tions, and others are Chinese-specific lexical con- 
structions like S ~ ~-~ S NP ~ S, which are 
obviously unnecessary to handle English. About 
27.7% of the non-unary Chinese productions were 
mirrored and the total number of productions in the 
final ITG is 368. 
For the experiment, 222 English sentences with 
a maximum length of 20 words from the parallel 
corpus were randomly selected. Some examples of 
the output are shown in Figure 2. No morphological 
processing has been used to correct the output, and 
up to now we have only been testing with a bigram 
model trained on extremely small corpus. 
With respect to objective 1 (increasing translation 
speed), the new model is very encouraging. Ta- 
ble 1 shows that over 90% of the samples can be 
processed within one minute by the grammatical 
channel model, whereas that for the SBTG channel 
model is about 50%. This demonstrates the stronger 
1413 
Time 
(x) 
x < 30 secs. 
30 secs. < x < 1 min. 
x > 1 min. 
SBTG Grammatical 
Channel Channel 
83.3% 15.6% 
34.9% 
49.5% 
7.6% 
9.1% 
Table 1: Translation speed. 
Sentence meaning SBTG Grammatical 
preservation Channel Channel 
Correct 25.9% 32.3% 
Incorrect 74.1% 67.7 % 
Table 2: Translation accuracy. 
constraints on the search space given by the SITG. 
The natural trade-off is that constraining the 
structure of the input decreases robustness some- 
what. Approximately 13% of the test corpus could 
not be parsed in the grammatical channel model. 
As mentioned earlier, this figure is likely to vary 
widely depending on the characteristics of the tar- 
get grammar. Of course, one can simply back off 
to the SBTG model when the grammatical channel 
rejects an input sentence. 
With respect to objective 2 (improving meaning- 
preservation accuracy), the new model is also 
promising. Table 2 shows that the percentage of 
meaningfully translated sentences rises from 26% to 
32% (ignoring the rejected cases). 7 We have judged 
only whether the correct meaning is conveyed by the 
translation, paying particular attention to word order 
and grammaticality, but otherwise ignoring morpho- 
logical and function word choices. 
7 Conclusion 
Currently we are designing a tight generation- 
oriented Chinese grammar to replace our robust 
parsing-oriented grammar. We will use the new 
grammar to quantitatively evaluate objective 3. We 
are also studying complementary approaches to 
the English word deletion performed by word- 
skipping--i.e., extensions that insert Chinese words 
suggested by the target grammar into the output. 
The framework seeds a natural transition toward 
pattern-based translation models (objective 4). One 
7These accuracy rates are relatively low because these ex- 
periments are being conducted with new lexicons and grammar 
on a new translation direction (English-Chinese). 
can post-edit the productions of a mirrored SITG 
more carefully and extensively than we have done 
in our cursory pruning, gradually transforming the 
original monolingual productions into a set of true 
transduction rule patterns. This provides a smooth 
evolution from a purely statistical model toward a 
hybrid model, as more linguistic resources become 
available. 
We have described a new stochastic grammati- 
cal channel model for statistical machine translation 
that exhibits several nice properties in comparison 
with Wu's SBTG model and IBM's word alignment 
model. The SITG-based channel increases trans- 
lation speed, improves meaning-preservation accu- 
racy, permits tight target CFGs to be incorporated 
for improving output grammaticality, and suggests 
a natural evolution toward transduction rule mod- 
els. The input CFG is adapted for use via produc- 
tion mirroring, part-of-speech mapping, and word- 
skipping. We gave a polynomial-time translation 
algorithm that requires only a translation lexicon, 
plus a CFG and bigram language model for the tar- 
get language. More linguistic knowledge about the 
target language is employed than in pure statisti- 
cal translation models, but Wu's SBTG polynomial- 
time bound on search cost is retained and in fact the 
search space can be significantly reduced by using 
a good grammar. Output always conforms to the 
given target grammar. 
Acknowledgments 
Thanks to the SILC group members: Xuanyin Xia, Daniel 
Chan, Aboy Wong, Vincent Chow & James Pang. 

References 
Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theorb, of Parsing. 
Translation. and Compiling. Prentice Hall, Englewood Cliffs, NJ. 
G. Edward Barton, Robert C. Berwick, and Eric. S Ristad. 1987. Com- 
putational Complexity and Natural Language. MIT Press, Cam- 
bridge, MA. 
Beijing Language and Culture Univ.. 1996. Sucheng Hanyu Chuji 
Jiaocheng (A Short h~tensive Elementary Chb~ese Course), volume 
1-4. Beijing Language And Culture Univ. Press. 
Peter E Brown, John Cocke, Stephen A. DellaPietm, Vincent J. Del- 
laPietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and 
Paul S. Roossin. 1990. A statistical approach to machine transla- 
tion. ComputationalLinguistics, 16(2):29-85. 
Peter E Brown, Stephen A. DellaPietra, Vincent J. DellaPietra, and 
Robert L. Mercer. 1993. The mathematics of statistical ma- 
chine translation: Parameter estimation. Computational Lfl~guis- 
tics, 19(2):263-311. 
Jay Earley. 1970. An efficient context-free parsing algorithm. Com- 
munications of the Assoc. for Computing Machinerb', 13(2):94-102. 
Pascale Fung and Dekai Wu. 1994. Statistical augmentation of a Chi- 
nese machine-readabledictionary. In Proc. of the 2nd Annual Work- 
shop on Verb' Large Corpora, pg 69-85, Kyoto, Aug. 
T. Kasami. 1965. An efficient recognition and syntax analysis al- 
gorithm for context-free languages. Technical Report AFCRL-65- 
758, Air Force Cambridge Research Lab., Bedford, MA. 
Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an 
asymptotically optimal decoding algorithm. IEEE Transactions on 
h!formation Theory, 13:260-269. 
Dekai Wu and Pascale Fang. 1994. Improving Chinese tokenization 
with linguistic filters on statistical lexical acquisition. In Proc. of 
4th Conf. on ANLP, pg 180-181, Stuttgart, Oct. 
Dekai Wu and Xuanyin Xia. 1995. Large-scale automatic extraction 
of an English-Chinese lexicon. Machh~e Translation, 9(3--4):285- 
313. 
Dekai Wu. 1994. Aligning a parallel English-Chinese corpus statisti- 
cally with lexical criteria. In Proc. of 32nd Annual Conf. of Assoc. 
fi~r ComputationalLinguistics, pg 80-87, Las Cruces, Jun. 
Dekai Wu. 1995a. An algorithm for simultaneously bracketing parallel 
texts by aligning words. In Proc. of 33rd Annual Conf. of Assoc. for 
Computational Linguistics, pg 244-251, Cambridge, MA, Jun. 
Dekai Wu. 1995b. Grammarless extraction of phrasal translation ex- 
amples from parallel texts. In TMI-95, Proc. of the 6th hmi Conf. 
on Theoretical and Methodological Issues in Machine Translation, 
volume 2, pg 354-372, Leuven, Belgium, Jul. 
Dekai Wu. 1995c. Stochastic inversion transduction grammars, with 
application to segmentation, bracketing, and alignment of parallel 
corpora. In Proc. of IJCAI-95, 14th InM Joint Conf. on Artificial 
Intelligence, pg 1328-1334, Montreal, Aug. 
Dekai Wu. 1995d. Trainable coarse bilingual grammars for parallel 
text bracketing. In Proc. of the 3rdAnnual Workshop on Verb' Large 
Corpora, pg 69-81, Cambridge, MA, Jun. 
Dekai Wu. 1996. A polynomial-time algorithm for statistical machine 
translation. In Proc. of the 34th Annual Conf. of the Assoc. for Com, 
putational Linguistics, pg 152-158, Santa Cruz, CA, Jun. 
Dekai Wu. 1997. Stochastic inversion transduction grammars and 
bilingual parsing of parallel corpora. Computational Linguistics, 
23(3):377--404, Sept. 
David H. Younger. 1967. Recognition and parsing of context-free lan- 
guages in time n 3. hzformation and Control, 10(2): 189-208. 
