Proceedings of the Workshop on Statistical Machine Translation, pages 23–30,
New York City, June 2006. c©2006 Association for Computational Linguistics
Quasi-Synchronous Grammars:
Alignment by Soft Projection of Syntactic Dependencies
David A. Smith and Jason Eisner
Department of Computer Science
Center for Language and Speech Processing
Johns Hopkins University
Baltimore, MD 21218, USA
{dasmith,eisner}@jhu.edu
Abstract
Many syntactic models in machine trans-
lation are channels that transform one
tree into another, or synchronous gram-
mars that generate trees in parallel. We
presentanewmodelofthetranslationpro-
cess: quasi-synchronous grammar (QG).
Given a source-language parse tree T1, a
QG defines a monolingual grammar that
generates translations of T1. The trees
T2 allowed by this monolingual gram-
mar are inspired by pieces of substruc-
ture in T1 and aligned to T1 at those
points. We describe experiments learning
quasi-synchronous context-free grammars
from bitext. As with other monolingual
language models, we evaluate the cross-
entropy of QGs on unseen text and show
thatabetterfittobilingualdataisachieved
by allowing greater syntactic divergence.
When evaluated on a word alignment task,
QG matches standard baselines.
1 Motivation and Related Work
1.1 Sloppy Syntactic Alignment
This paper proposes a new type of syntax-based
model for machine translation and alignment. The
goal is to make use of syntactic formalisms, such as
context-free grammar or tree-substitution grammar,
without being overly constrained by them.
Let S1 and S2 denote the source and target sen-
tences. We seek to model the conditional probability
p(T2,A | T1) (1)
where T1 is a parse tree for S1, T2 is a parse tree
for S2, and A is a node-to-node alignment between
them. This model allows one to carry out a variety
of alignment and decoding tasks. Given T1, one can
translate it by finding the T2 and A that maximize
(1). Given T1 and T2, one can align them by finding
the A that maximizes (1) (equivalent to maximizing
p(A | T2,T1)). Similarly, one can align S1 and S2
by finding the parses T1 and T2, and alignment A,
that maximize p(T2,A | T1) · p(T1 | S1), where
p(T1 | S1) is given by a monolingual parser. We
usually accomplish such maximizations by dynamic
programming.
Equation (1) does not assume that T1 and T2 are
isomorphic. For example, a model might judge T2
and A to be likely, given T1, provided that many—
but not necessarily all—of the syntactic dependen-
cies in T1 are aligned with corresponding depen-
dencies in T2. Hwa et al. (2002) found that hu-
man translations from Chinese to English preserved
only 39–42% of the unlabeled Chinese dependen-
cies. They increased this figure to 67% by using
more involved heuristics for aligning dependencies
across these two languages. That suggests that (1)
should be defined to consider more than one depen-
dency at a time.
This inspires the key novel feature of our models:
A does not have to be a “well-behaved” syntactic
alignment. Any portion of T2 can align to any por-
tion of T1, or to NULL. Nodes that are syntactically
related in T1 do not have to translate into nodes that
are syntactically related in T2—although (1) is usu-
ally higher if they do.
This property makes our approach especially
promising for aligning freely, or erroneously, trans-
lated sentences, and for coping with syntactic diver-
23
gences observed between even closely related lan-
guages (Dorr, 1994; Fox, 2002). We can patch to-
gether an alignment without accounting for all the
details of the translation process. For instance, per-
haps a source NP (figure 1) or PP (figure 2) appears
“out of place” in the target sentence. A linguist
might account for the position of the PP auf diese
Frage either syntactically (by invoking scrambling)
or semantically (by describing a deep analysis-
transfer-synthesis process in the translator’s head).
But an MT researcher may not have the wherewithal
to design, adequately train, and efficiently compute
with “deep” accounts of this sort. Under our ap-
proach, it is possible to use a simple, tractable syn-
tactic model, but with some contextual probability
of “sloppy” transfer.
1.2 From Synchronous to Quasi-Synchronous
Grammars
Because our approach will let anything align to
anything, it is reminiscent of IBM Models 1–5
(Brown et al., 1993). It differs from the many ap-
proaches where (1) is defined by a stochastic syn-
chronous grammar (Wu, 1997; Alshawi et al., 2000;
Yamada and Knight, 2001; Eisner, 2003; Gildea,
2003; Melamed, 2004) and from transfer-based sys-
tems defined by context-free grammars (Lavie et al.,
2003).
The synchronous grammar approach, originally
due to Shieber and Schabes (1990), supposes that T2
is generated in lockstep to T1.1 When choosing how
to expand a certain VP node in T2, a synchronous
CFG process would observe that this node is aligned
to a node VPprime in T1, which had been expanded in T1
by VPprime → NPprime Vprime. This might bias it toward choos-
ing to expand the VP in T2 as VP → V NP, with the
new children V aligned to Vprime and NP aligned to NPprime.
The process then continues recursively by choosing
moves to expand these children.
One can regard this stochastic process as an in-
stance of analysis-transfer-synthesis MT. Analysis
chooses a parse T1 given S1. Transfer maps the
context-free rules in T1 to rules of T2. Synthesis
1The usual presentation describes a process that generates
T1 and T2 jointly, leading to a joint model p(T2,A,T1). Divid-
ing by the marginal p(T1) gives a conditional model p(T2,A |
T1) as in (1). In the text, we directly describe an equivalent
conditional process for generating T2,A given T1.
deterministically assembles the latter rules into an
actual tree T2 and reads off its yield S2.
What is worrisome about the synchronous pro-
cess is that it can only produce trees T2 that are
perfectly isomorphic to T1. It is possible to relax
thisrequirementbyusingsynchronousgrammarfor-
malismsmoresophisticatedthanCFG:2 onecanper-
mit unaligned nodes (Yamada and Knight, 2001),
duplicated children (Gildea, 2003)3, or alignment
between elementary trees of differing sizes rather
than between single rules (Eisner, 2003; Ding and
Palmer, 2005; Quirk et al., 2005). However, one
would need rather powerful and slow grammar for-
malisms (Shieber and Schabes, 1990; Melamed et
al., 2004), often with discontiguous constituents, to
account for all the linguistic divergences that could
arise from different movement patterns (scrambling,
wh-in situ) or free translation. In particular, a syn-
chronous grammar cannot practically allow S2 to be
any permutation of S1, as IBM Models 1–5 do.
Our alternative is to define a “quasi-synchronous”
stochastic process. It generates T2 in a way that is
not in thrall to T1 but is “inspired by it.” (A human
translator might be imagined to behave similarly.)
When choosing how to expand nodes of T2, we are
influenced both by the structure of T1 and by mono-
lingual preferences about the structure of T2. Just as
conditional Markov models can more easily incor-
porate global features than HMMs, we can look at
the entire tree T1 at every stage in generating T2.
2 Quasi-Synchronous Grammar
Given an input S1 or its parse T1, a quasi-
synchronous grammar (QG) constructs a monolin-
gual grammar for parsing, or generating, the possi-
ble translations S2—that is, a grammar for finding
appropriate trees T2. What ties this target-language
grammar to the source-language input? The gram-
mar provides for target-language words to take on
2When one moves beyond CFG, the derived trees T1 and
T2 are still produced from a single derivation tree, but may be
shaped differently from the derivation tree and from each other.
3For tree-to-tree alignment, Gildea proposed a clone opera-
tion that allowed subtrees of the source tree to be reused in gen-
erating a target tree. In order to preserve dynamic programming
constraints, theidentityoftheclonedsubtreeischosenindepen-
dently of its insertion point. This breakage of monotonic tree
alignment moves Gildea’s alignment model from synchronous
to quasi-synchronous.
24
Then:1 we:2
could:3
deal:4 .:10
with:5 later:9
Chernobyl:6
some:7
time:8
Tschernobyl/NE:6
koennte/VVFIN:3
dann/ADV:1 etwas/ADV:0 spaeter/ADJ:1 an/PREP:0 kommen/VVINF:0 ./S-SYMBOL:10
Reihe/NN:0
die/ART:0
Figure 1: German and English dependency parses and their alignments from our system where German
is the target language. Tschernobyl depends on k¨onnte even though their English analogues are not in a
dependency relationship. Note the parser’s error in not attaching etwas to sp¨ater.
German: Tschernobyl k¨onnte dann etwas sp¨ater an die Reihe kommen .
Literally: Chernobyl could then somewhat later on the queue come.
English: Then we could deal with Chernobyl some time later .
I:1
did:2
not:3 unfortunately:4 receive:5 .:11
answer:7
an:6 to:8
question:10
this:9
Auf/PREP:8
Frage/NN:10
diese/DEM:9
habe/VHFIN:2 ich/PPRO:1 leider/ADV:4
keine/INDEF:3
Antwort/NN:7
bekommen/VVpast:5
./S-SYMBOL:11
Figure 2: Herethe German sentenceexhibits scrambling ofthe phrase auf diese Frage and negatesthe object
of bekommen instead of the verb itself.
German: Auf diese Frage habe ich leider keine Antwort bekommen .
Literally: To this question have I unfortunately no answer received.
English: I did not unfortunately receive an answer to this question .
25
multiple hidden “senses,” which correspond to (pos-
sibly empty sets of) word tokens in S1 or nodes in
T1. To take a familiar example, when parsing the
English side of a French-English bitext, the word
bank might have the sense banque (financial) in one
sentence and rive (littoral) in another.
TheQG4 considersthe“sense”oftheformerbank
token to be a pointer to the particular banque token
to which it aligns. Thus, a particular assignment of
S1 “senses” to word tokens in S2 encodes a word
alignment.
Now, selectional preferences in the monolingual
grammar can be influenced by these T1-specific
senses. So they can encode preferences for how T2
ought to copy the syntactic structure of T1. For ex-
ample, if T1 contains the phrase banque nationale,
then the QG for generating a corresponding T2 may
encourage any T2 English noun whose sense is
banque (more precisely, T1’s token of banque) to
generate an adjectival English modifier with sense
nationale. The exact probability of this, as well as
the likely identity and position of that English mod-
ifier (e.g., national bank), may also be influenced by
monolingual facts about English.
2.1 Definition
A quasi-synchronous grammar is a monolingual
grammar that generates translations of a source-
language sentence. Each state of this monolingual
grammar is annotated with a “sense”—a set of zero
or more nodes from the source tree or forest.
For example, consider a quasi-synchronous
context-free grammar (QCFG) for generating trans-
lations of a source tree T1. The QCFG generates the
target sentence using nonterminals from the cross
product U × 2V1, where U is the set of monolingual
target-language nonterminals such as NP, and V1 is
the set of nodes in T1.
Thus, a binarized QCFG has rules of the form
〈A,α〉 → 〈B,β〉〈C,γ〉 (2)
〈A,α〉 → w (3)
where A,B,C ∈ U are ordinary target-language
nonterminals, α,β,γ ∈ 2V1 are sets of source tree
4By abuse of terminology, we often use “QG” to refer to the
T1-specific monolingual grammar, although the QG is properly
a recipe for constructing such a grammar from any input T1.
nodes to which A,B,C respectively align, and w is
a target-language terminal.
Similarly, a quasi-synchronous tree-substitution
grammar (QTSG) annotates the root and frontier
nodes of its elementary trees with sets of source
nodes from 2V1.
2.2 Taming Source Nodes
This simple proposal, however, presents two main
difficulties. First, the number of possible senses for
each target node is exponential in the number of
source nodes. Second, note that the senses are sets
of source tree nodes, not word types or absolute sen-
tence positions as in some other translation models.
Except in the case of identical source trees, source
tree nodes will not recur between training and test.
Toovercomethefirstproblem,wewantfurtherre-
strictionsonthesetαinaQGstatesuchas〈A,α〉. It
shouldnotbeanarbitrarysetofsourcenodes. Inthe
experiments of this paper, we adopt the simplest op-
tion of requiring |α| ≤ 1. Thus each node in the tar-
get tree is aligned to a single node in the source tree,
orto∅(thetraditional NULL alignment). Thisallows
one-to-many but not many-to-one alignments.
To allow many-to-many alignments, one could
limit |α| to at most 2 or 3 source nodes, perhaps fur-
ther requiring the 2 or 3 source nodes to fall in a par-
ticular configuration within the source tree, such as
child-parent or child-parent-grandparent. With that
configurational requirement, the number of possi-
ble senses α remains small—at most three times the
number of source nodes.
We must also deal with the menagerie of differ-
ent source tree nodes in different sentences. In other
words,howcanwetietheparametersofthedifferent
QGs that are used to generate translations of differ-
ent source sentences? The answer is that the proba-
bility or weight of a rule such as (2) should depend
on the specific nodes in α, β, and γ only through
their properties—e.g., their nonterminal labels, their
head words, and their grammatical relationship in
the source tree. Such properties do recur between
training and test.
For example, suppose for simplicity that |α| =
|β| = |γ| = 1. Then the rewrite probabilities of (2)
and (3) could be log-linearly modeled using features
that ask whether the single node in α has two chil-
dren in the source tree; whether its children in the
26
source are the nodes in β and γ; whether its non-
terminal label in the source is A; whether its fringe
in the source translates as w; and so on. The model
shouldalsoconsidermonolingualfeaturesof(2)and
(3), evaluating in particular whether A → BC is
likely in the target language.
Whether rule weights are given by factored gener-
ative models or by naive Bayes or log-linear models,
we want to score QG productions with a small set of
monolingual and bilingual features.
2.3 Synchronous Grammars Again
Finally, note that synchronous grammar is a special
case of quasi-synchronous grammar. In the context-
free case, a synchronous grammar restricts senses to
single nodes in the source tree and the NULL node.
Further, for any k-ary production
〈X0,α0〉 → 〈X1,α1〉...〈Xk,αk〉
a synchronous context-free grammar requires that
1. (∀i negationslash= j) αi negationslash= αj unless αi = NULL,
2. (∀i > 0) αi is a child of α0 in the source tree,
unless αi = NULL.
Since NULL has no children in the source tree, these
rules imply that the children of any node aligned to
NULL are themselves aligned to NULL. The con-
struction for synchronous tree-substitution and tree-
adjoining grammars goes through similarly but op-
erates on the derivation trees.
3 Parameterizing a QCFG
Recall that our goal is a conditional model of
p(T2,A | T1). For the remainder of this paper, we
adopt a dependency-tree representation of T1 and
T2. Each tree node represents a word of the sentence
together with a part-of-speech tag. Syntactic depen-
dencies in each tree are represented directly by the
parent-child relationships.
Why this representation? First, it helps us con-
cisely formulate a QG translation model where the
source dependencies influence the generation of tar-
get dependencies (see figure 3). Second, for evalu-
ation, it is trivial to obtain the word-to-word align-
ments from the node-to-node alignments. Third, the
part-of-speech tags are useful backoff features, and
in fact play a special role in our model below.
When stochastically generating a translation T2,
ourquasi-synchronousgenerativeprocesswillbein-
fluenced by both fluency and adequacy. That is, it
considers both the local well-formedness of T2 (a
monolingual criterion) and T2’s local faithfulness
to T1 (a bilingual criterion). We combine these in
a simple generative model rather than a log-linear
model. When generating the children of a node in
T2, the process first generates their tags using mono-
lingual parameters (fluency), and then fills in in the
wordsusingbilingualparameters(adequacy)thatse-
lect and translate words from T1.5
Concretely, each node in T2 is labeled by a triple
(tag, word, aligned word). Given a parent node
(p,h,hprime) in T2, we wish to generate sequences of
left and right child nodes, of the form (c,a,aprime).
Our monolingual parameters come from a simple
generative model of syntax used for grammar induc-
tion: the Dependency Model with Valence (DMV) of
Klein and Manning (2004). In scoring dependency
attachments, DMV uses tags rather than words. The
parameters of the model are:
1. pchoose(c | p,dir): the probability of generat-
ing c as the next child tag in the sequence of
dir children, where dir ∈ {left,right}.
2. pstop(s | h,dir,adj): the probability of gener-
ating no more child tags in the sequence of dir
children. This is conditioned in part on the “ad-
jacency” adj ∈ {true,false}, which indicates
whether the sequence of dir children is empty
so far.
Our bilingual parameters score word-to-word
translation and aligned dependency configurations.
We thus use the conditional probability ptrans(a |
aprime) that source word aprime, which may be NULL, trans-
lates as target word a. Finally, when a parent word
h aligned to hprime generates a child, we stochastically
decide to align the child to a node aprime in T1 with
one several possible relations to hprime. A “monotonic”
dependency alignment, for example, would have
hprime and aprime in a parent-child relationship like their
target-tree analogues. In different versions of the
model, we allowed various dependency alignment
configurations (figure 3). These configurations rep-
5This division of labor is somewhat artificial, and could be
remedied in a log-linear model, Naive Bayes model, or defi-
cient generative model that generates both tags and words con-
ditioned on both monolingual and bilingual context.
27
resent cases where the parent-child dependency be-
ing generated by the QG in the target language maps
onto source-language child-parent, for head swap-
ping; the same source node, for two-to-one align-
ment; nodes that are siblings or in a c-command re-
lationship, for scrambling and extraposition; or in
a grandparent-grandchild relationship, e.g. when a
preposition is inserted in the source language. We
alsoalloweda“none-of-the-above”configuration,to
account for extremely mismatched sentences.
The probability of the target-language depen-
dency treelet rooted at h is thus:
P(D(h) | h,hprime,p) =productdisplay
dir∈{l,r}
productdisplay
c∈depsD(p,dir)
P(D(c) | a,aprime,c) × pstop(nostop | p,dir,adj)
×pchoose(c | p,dir)
×pconfig(config) × ptrans(a | aprime)
pstop(stop | p,dir,adj)
4 Experiments
We claim that for modeling human-translated bitext,
itisbettertoprojectsyntaxonlyloosely. Toevaluate
this claim, we train quasi-synchronous dependency
grammars that allow progressively more divergence
from monotonic tree alignment. We evaluate these
models on cross-entropy over held-out data and on
error rate in a word-alignment task.
One might doubt the use of dependency trees
for alignment, since Gildea (2004) found that con-
stituencytreesalignedbetter. Thatexperiment,how-
ever, aligned only the 1-best parse trees. We too will
consider only the 1-best source tree T1, but in con-
strast to Gildea, we will search for the target tree T2
that aligns best with T1. Finding T2 and the align-
ment is simply a matter of parsing S2 with the QG
derived from T1.
4.1 Data and Training
We performed our modeling experiments with the
German-English portion of the Europarl European
Parliament transcripts (Koehn, 2002). We obtained
monolingual parse trees from the Stanford German
and English parsers (Klein and Manning, 2003).
Initial estimates of lexical translation probabilities
came from the IBM Model 4 translation tables pro-
duced by GIZA++ (Brown et al., 1993; Och and
Ney, 2003).
All text was lowercased and numbers of two or
more digits were converted to an equal number of
hash signs. The bitext was divided into training
sets of 1K, 10K, and 100K sentence pairs. We held
out one thousand sentences for evaluating the cross-
entropy of the various models and hand-aligned
100 sentence pairs to evaluate alignment error rate
(AER).
We trained the model parameters on bitext using
the Expectation-Maximization (EM) algorithm. The
T1 tree is fully observed, but we parse the target lan-
guage. Asnoted, theinitiallexicaltranslationproba-
bilities came from IBM Model 4. We initialized the
monolingual DMV parameters in one of two ways:
using either simple tag co-occurrences as in (Klein
andManning,2004)or“supervised”countsfromthe
monolingual target-language parser. This latter ini-
tialization simulates the condition when one has a
small amount of bitext but a larger amount of tar-
get data for language modeling. As with any mono-
lingual grammar, we perform EM training with the
Inside-Outside algorithm, computing inside prob-
abilities with dynamic programming and outside
probabilities through backpropagation.
Searchingthefullspaceoftarget-languagedepen-
dency trees and alignments to the source tree con-
sumed several seconds per sentence. During train-
ing, therefore, we constrained alignments to come
from the union of GIZA++ Model 4 alignments.
These constraints were applied only during training
and not during evaluation of cross-entropy or AER.
4.2 Conditional Cross-Entropy of the Model
TotesttheexplanatorypowerofourQCFG,weeval-
uated its conditional cross-entropy on held-out data
(table 1). In other words, we measured how well a
trained QCFG could predict the true translation of
novel source sentences by summing over all parses
of the target given the source. We trained QCFG
models under different conditions of bitext size and
parameter initialization. However, the principal in-
dependent variable was the set of dependency align-
ment configurations allowed.
From these cross-entropy results, it is clear that
strictly synchronous grammar is unwise. We ob-
28
(a) parent-child (b) child-parent (c) same node
sehe
ich
see
I
schwimmt
gern swimming
likes Voelkerrecht law
international
(d) siblings (e) grandparent-grandchild (f) c-comandbekomen
auf Antwort to
answer Wahlkampf
von
campaign
203
203
sagte
Was das what
kaufte
bought
Figure 3: When a head h aligned to hprime generates a new child a aligned to aprime under the QCFG, hprime and aprime may be related in the
source tree as, among other things, (a) parent–child, (b) child–parent, (c) identical nodes, (d) siblings, (e) grandparent–grandchild,
(f) c-commander–c-commandee, (g) none of the above. Here German is the source and English is the target. Case (g), not pictured
above, can be seen in figure 1, in English-German order, where the child-parent pair Tschernobyl k¨onnte correspond to the words
Chernobyl and could, respectively. Since could dominates Chernobyl, they are not in a c-command relationship.
Permitted configurations CE CE CE
at 1k 10k 100k
∅ or parent-child (a) 43.82 22.40 13.44
+ child-parent (b) 41.27 21.73 12.62
+ same node (c) 41.01 21.50 12.38
+ all breakages (g) 35.63 18.72 11.27
+ siblings (d) 34.59 18.59 11.21
+ grandparent-grandchild (e) 34.52 18.55 11.17
+ c-command (f) 34.46 18.59 11.27
No alignments allowed 60.86 53.28 46.94
Table 1: Cross-entropy on held-out data with different depen-
dency configurations (figure 3) allowed, for 1k, 10k, and 100k
training sentences. The big error reductions arrive when we
allow arbitrary non-local alignments in condition (g). Distin-
guishingsomecommoncasesofnon-localalignmentsimproves
performance further. For comparison, we show cross-entropy
when every target language node is unaligned.
tain comparatively poor performance if we require
parent-childpairsinthetargettreetoaligntoparent-
child pairs in the source (or to parent-NULL or
NULL-NULL). Performance improves as we allow
and distinguish more alignment configurations.
4.3 Word Alignment
We computed standard measures of alignment preci-
sion, recall, and error rate on a test set of 100 hand-
aligned German sentence pairs with 1300 alignment
links. As with many word-alignment evaluations,
we do not score links to NULL. Just as for cross-
entropy, we see that more permissive alignments
lead to better performance (table 2).
Having selected the best system using the cross-
entropy measurement, we compare its alignment er-
ror rate against the standard GIZA++ Model 4 base-
lines. As Figure 4 shows, our QCFG for German →
Englishconsistentlyproducesbetteralignmentsthan
the Model 4 channel model for the same direction,
German → English. This comparison is the appro-
priate one because both of these models are forced
to align each English word to at most one German
word. 6
5 Conclusions
With quasi-synchronous grammars, we have pre-
sented a new approach to syntactic MT: construct-
ing a monolingual target-language grammar that de-
scribes the aligned translations of a source-language
sentence. We described a simple parameterization
6For German → English MT, one would use a German →
English QCFG as above, but an English → German channel
model. In this arguably inappropriate comparison, Figure 4
shows, the Model 4 channel model produces slightly better
word alignments than the QG.
29
Permitted configurations AER AER AER
at 1k 10k 100k
∅ or parent-child (a) 40.69 39.03 33.62
+ child-parent (b) 43.17 39.78 33.79
+ same node (c) 43.22 40.86 34.38
+ all breakages (g) 37.63 30.51 25.99
+ siblings (d) 37.87 33.36 29.27
+ grandparent-grandchild (e) 36.78 32.73 28.84
+ c-command (f) 37.04 33.51 27.45
Table 2: Alignment error rate (%) with different dependency
configurations allowed.
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 1000  10000  100000  1e+06
alignment error rate
training sentence pairs
QCFGGiza4
Giza4 bk
Figure 4: Alignment error rate with best model (all break-
ages). The QCFG consistently beat one GIZA++ model and
was close to the other.
with gradually increasing syntactic domains of lo-
cality, and estimated those parameters on German-
English bitext.
The QG formalism admits many more nuanced
options for features than we have exploited. In par-
ticular, we now are exploring log-linear QGs that
score overlapping elementary trees of T2 while con-
sidering the syntactic configuration and lexical con-
tent of the T1 nodes to which each elementary tree
aligns.
Even simple QGs, however, turned out to do quite
well. Our evaluation on a German-English word-
alignment task showed them to be competitive with
IBM model 4—consistently beating the German-
English direction by several percentage points of
alignment error rate and within 1% AER of the
English-German direction. In particular, alignment
accuracy benefited from allowing syntactic break-
ages between the two dependency structures.
We are also working on a translation decoding us-
ing QG. Our first system uses the QG to find optimal
T2 aligned to T1 and then extracts a synchronous
tree-substitution grammar from the aligned trees.
Our second system searches a target-language vo-
cabulary for the optimal T2 given the input T1.
Acknowledgements
This work was supported by a National Science
Foundation Graduate Research Fellowship for the
first author and by NSF Grant No. 0313193.

References
H. Alshawi, S. Bangalore, and S. Douglas. 2000. Learning
dependency translation models as collections of finite state
head transducers. CL, 26(1):45–60.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L.
Mercer. 1993. The mathematics of statistical machine trans-
lation: Parameter estimation. CL, 19(2):263–311.
Y. Ding and M. Palmer. 2005. Machine translation using prob-
abilistic synchronous dependency insertion grammars. In
ACL, pages 541–548.
B. J. Dorr. 1994. Machine translation divergences: A formal
description and proposed solution. Computational Linguis-
tics, 20(4):597–633.
J. Eisner. 2003. Learning non-isomorphic tree mappings for
machine translation. In ACL Companion Vol.
H. J. Fox. 2002. Phrasal cohesion and statistical machine trans-
lation. In EMNLP, pages 392–399.
D. Gildea. 2003. Loosely tree-based alignment for machine
translation. In ACL, pages 80–87.
D. Gildea. 2004. Dependencies vs. constituents for tree-based
alignment. In EMNLP, pages 214–221.
R. Hwa, P. Resnik, A. Weinberg, and O. Kolak. 2002. Evalu-
ating translational correspondence using annotation projec-
tion. In ACL.
D. Klein and C. D. Manning. 2003. Accurate unlexicalized
parsing. In ACL, pages 423–430.
D. Klein and C. D. Manning. 2004. Corpus-based induction of
syntactic structure: Models of dependency and constituency.
In ACL, pages 479–486.
P. Koehn. 2002. Europarl: A multilingual
corpus for evaluation of machine translation.
http://www.iccs.informatics.ed.ac.uk/˜pkoehn/-
publications/europarl.ps.
A. Lavie, S. Vogel, L. Levin, E. Peterson, K. Probst, A. F.
Llitj´os, R. Reynolds, J. Carbonell, and R. Cohen. 2003. Ex-
perimentswithaHindi-to-Englishtransfer-basedMTsystem
under a miserly data scenario. ACM Transactions on Asian
Language Information Processing, 2(2):143 – 163.
I. D. Melamed, G. Satta, and B. Wellington. 2004. Generalized
multitext grammars. In ACL, pages 661–668.
I. D. Melamed. 2004. Statistical machine translation by pars-
ing. In ACL, pages 653–660.
F.J.OchandH.Ney. 2003. Asystematiccomparisonofvarious
statistical alignment models. CL, 29(1):19–51.
C. Quirk, A. Menezes, and C. Cherry. 2005. Dependency
treelet translation: Syntactically informed phrasal SMT. In
ACL, pages 271–279.
S. M. Shieber and Y. Schabes. 1990. Synchronous tree-
adjoining grammars. In ACL, pages 253–258.
D. Wu. 1997. Stochastic inversion transduction grammars and
bilingual parsing of parallel corpora. CL, 23(3):377–403.
K. Yamada and K. Knight. 2001. A syntax-based statistical
translation model. In ACL.
