Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 953–960,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Inducing Word Alignments with Bilexical Synchronous Trees
Hao Zhang and Daniel Gildea
Computer Science Department
University of Rochester
Rochester, NY 14627
Abstract
This paper compares different bilexical
tree-based models for bilingual alignment.
EM training for the new model bene-
 ts from the dynamic programming  hook
trick . The model produces improved de-
pendency structure for both languages.
1 Introduction
A major dif culty in statistical machine translation
is the trade-off between representational power
and computational complexity. Real-world cor-
pora for language pairs such as Chinese-English
have complex reordering relationships that are not
captured by current phrase-based MT systems, de-
spite their state-of-the-art performance measured
in competitive evaluations. Synchronous gram-
mar formalisms that are capable of modeling
such complex relationships while maintaining the
context-free property in each language have been
proposed for many years, (Aho and Ullman, 1972;
Wu, 1997; Yamada and Knight, 2001; Melamed,
2003; Chiang, 2005), but have not been scaled to
large corpora and long sentences until recently.
In Synchronous Context Free Grammars, there
are two sources of complexity, grammar branch-
ing factor and lexicalization. In this paper we fo-
cus on the second issue, constraining the gram-
mar to the binary-branching Inversion Transduc-
tion Grammar of Wu (1997). Lexicalization seems
likely to help models predict alignment patterns
between languages, and has been proposed by
Melamed (2003) and implemented by Alshawi et
al. (2000) and Zhang and Gildea (2005). However,
each piece of lexical information considered by a
model multiplies the number of states of dynamic
programming algorithms for inference, meaning
that we must choose how to lexicalize very care-
fully to control complexity.
In this paper we compare two approaches to
lexicalization, both of which incorporate bilexical
probabilities. One model uses bilexical probabil-
ities across languages, while the other uses bilex-
ical probabilities within one language. We com-
pare results on word-level alignment, and investi-
gate the implications of the choice of lexicaliza-
tion on the speci cs of our alignment algorithms.
The new model, which bilexicalizes within lan-
guages, allows us to use the  hook trick (Eis-
ner and Satta, 1999) and therefore reduces com-
plexity. We describe the application of the hook
trick to estimation with Expectation Maximization
(EM). Despite the theoretical bene ts of the hook
trick, it is not widely used in statistical monolin-
gual parsers, because the savings do not exceed
those obtained with simple pruning. We speculate
that the advantages may be greater in an EM set-
ting, where parameters to guide pruning are not
(initially) available.
In order to better understand the model, we an-
alyze its performance in terms of both agreement
with human-annotated alignments, and agreement
with the dependencies produced by monolingual
parsers. We  nd that within-language bilexical-
ization does not improve alignment over cross-
language bilexicalization, but does improve recov-
ery of dependencies. We  nd that the hook trick
signi cantly speeds training, even in the presence
of pruning.
Section 2 describes the generative model. The
hook trick for EM is explained in Section 3. In
Section 4, we evaluate the model in terms of align-
ment error rate and dependency error rate. We
conclude with discussions in Section 5.
953
2 Bilexicalization of Inversion
Transduction Grammar
The Inversion Transduction Grammar of Wu
(1997) models word alignment between a transla-
tion pair of sentences by assuming a binary syn-
chronous tree on top of both sides. Using EM
training, ITG can induce good alignments through
exploring the hidden synchronous trees from in-
stances of string pairs.
ITG consists of unary production rules that gen-
erate English/foreign word pairs e/f:
X → e/f
and binary production rules in two forms that gen-
erate subtree pairs, written:
X → [Y Z]
and
X → 〈Y Z〉
The square brackets indicate the right hand side
rewriting order is the same for both languages.
The pointed brackets indicate there exists a type of
syntactic reordering such that the two right hand
side constituents rewrite in the opposite order in
the second language.
The unary rules account for the alignment links
across two sides. Either e or f may be a special
null word, handling insertions and deletions. The
two kinds of binary rules (called straight rules and
inverted rules) build up a coherent tree structure
on top of the alignment links. From a modeling
perspective, the synchronous tree that may involve
inversions tells a generative story behind the word
level alignment.
An example ITG tree for the sentence pair Je
les vois / I see them is shown in Figure 1(left). The
probability of the tree is the product rule probabil-
ities at each node:
P(S → A)
· P(A → [C B])
· P(C → I/Je)
· P(B → 〈C C〉)
· P(C → see/vois)
· P(C → them/les)
The structural constraint of ITG, which is that
only binary permutations are allowed on each
level, has been demonstrated to be reasonable
by Zens and Ney (2003) and Zhang and Gildea
(2004). However, in the space of ITG-constrained
synchronous trees, we still have choices in making
the probabilistic distribution over the trees more
realistic. The original Stochastic ITG is the coun-
terpart of Stochastic CFG in the bitext space. The
probability of an ITG parse tree is simply a prod-
uct of the probabilities of the applied rules. Thus,
it only captures the fundamental features of word
links and re ects how often inversions occur.
2.1 Cross-Language Bilexicalization
Zhang and Gildea (2005) described a model in
which the nonterminals are lexicalized by English
and foreign language word pairs so that the inver-
sions are dependent on lexical information on the
left hand side of synchronous rules. By introduc-
ing the mechanism of probabilistic head selection
there are four forms of probabilistic binary rules
in the model, which are the four possibilities cre-
ated by taking the cross-product of two orienta-
tions (straight and inverted) and two head choices:
X(e/f) → [Y (e/f) Z]
X(e/f) → [Y Z(e/f)]
X(e/f) → 〈Y (e/f) Z〉
X(e/f) → 〈Y Z(e/f)〉
where (e/f) is a translation pair.
A tree for our example sentence under this
model is shown in Figure 1(center). The tree’s
probability is again the product of rule probabil-
ities:
P(S → A(see/vois))
· P(A(see/vois) → [C B(see/vois)])
· P(C → C(I/Je))
· P(B(see/vois) → 〈C(see/vois) C〉)
· P(C → C(them/les))
2.2 Head-Modifier Bilexicalization
One disadvantage of the model above is that it
is not capable of modeling bilexical dependen-
cies on the right hand side of the rules. Thus,
while the probability of a production being straight
or inverted depends on a bilingual word pair, it
does not take head-modi er relations in either lan-
guage into account. However, modeling complete
bilingual bilexical dependencies as theorized in
Melamed (2003) implies a huge parameter space
of O(|V |2|T|2), where |V | and |T| are the vo-
cabulary sizes of the two languages. So, in-
stead of modeling cross-language word transla-
tions and within-language word dependencies in
954
C
B
C
A
see/vois them/les
I/Je
S
C
C(I/Je)
C
C(them/les)
C
B(see/vois)
A(see/vois)
C(see/vois)
S S
C(I)
C(them)
them/les
I/Je C(see)
see/vois
B(see)
A(see)
Figure 1: Parses for an example sentence pair under unlexicalized ITG (left), cross-language bilexicaliza-
tion (center), and head-modi er bilexicaliztion (right). Thick lines indicate head child; crossbar indicates
inverted production.
a joint fashion, we factor them apart. We lexical-
ize the dependencies in the synchronous tree using
words from only one language and translate the
words into their counterparts in the other language
only at the bottom of the tree. Formally, we have
the following patterns of binary dependency rules:
X(e) → [Y (e) Z(eprime)]
X(e) → [Y (eprime) Z(e)]
X(e) → 〈Y (e) Z(eprime)〉
X(e) → 〈Y (eprime) Z(e)〉
where e is an English head and eprime is an English
modi er.
Equally importantly, we have the unary lexical
rules that generate foreign words:
X(e) → e/f
To make the generative story complete, we also
have a top rule that goes from the unlexicalized
start symbol to the highest lexicalized nonterminal
in the tree:
S → X(e)
Figure 1(right), shows our example sentence’s
tree under the new model. The probability of a
bilexical synchronous tree between the two sen-
tences is:
P(S → A(see))
· P(A(see) → [C(I) B(see)])
· P(C(I) → I/Je)
· P(B(see) → 〈C(see) C(them)〉)
· P(C(see) → see/vois)
· P(C(them) → them/les)
Interestingly, the lexicalized B(see) predicts
not only the existence of C(them), but also that
there is an inversion involved going from C(see)
to C(them). This re ects the fact that direct ob-
ject pronouns come after the verb in English, but
before the verb in French. Thus, despite condi-
tioning on information about words from only one
language, the model captures syntactic reordering
information about the speci c language pair it is
trained on. We are able to discriminate between
the straight and inverted binary nodes in our ex-
ample tree in a way that cross-language bilexical-
ization could not.
In terms of inferencing within the framework,
we do the usual Viterbi inference to  nd the best
bilexical synchronous tree and treat the depen-
dencies and the alignment given by the Viterbi
parse as the best ones, though mathematically the
best alignment should have the highest probabil-
ity marginalized over all dependencies constrained
by the alignment. We do unsupervised training to
obtain the parameters using EM. Both EM and
Viterbi inference can be done using the dynamic
programming framework of synchronous parsing.
3 Inside-Outside Parsing with the Hook
Trick
ITG parsing algorithm is a CYK-style chart pars-
ing algorithm extended to bitext. Instead of build-
ing up constituents over spans on a string, an ITG
chart parser builds up constituents over subcells
within a cell de ned by two strings. We use
β(X(e), s, t, u, v) to denote the inside probabil-
ity of X(e) which is over the cell of (s, t, u, v)
where (s, t) are indices into the source language
string and (u, v) are indices into the target lan-
guage string. We use α(X(e), s, t, u, v) to de-
note its outside probability. Figure 2 shows how
smaller cells adjacent along diagonals can be com-
bined to create a large cell. We number the sub-
cells counterclockwise. To analyze the complex-
ity of the algorithm with respect to input string
955
S
a0a1a0a1a0a1a0a1a0a1a0a1a0
a0a1a0a1a0a1a0a1a0a1a0a1a0
a0a1a0a1a0a1a0a1a0a1a0a1a0
a0a1a0a1a0a1a0a1a0a1a0a1a0
a0a1a0a1a0a1a0a1a0a1a0a1a0
a0a1a0a1a0a1a0a1a0a1a0a1a0
a2a1a2a1a2a1a2a1a2a1a2a1a2
a2a1a2a1a2a1a2a1a2a1a2a1a2
a2a1a2a1a2a1a2a1a2a1a2a1a2
a2a1a2a1a2a1a2a1a2a1a2a1a2
a2a1a2a1a2a1a2a1a2a1a2a1a2
a2a1a2a1a2a1a2a1a2a1a2a1a2a3a1a3a1a3a1a3a1a3a1a3a1a3
a3a1a3a1a3a1a3a1a3a1a3a1a3
a3a1a3a1a3a1a3a1a3a1a3a1a3
a3a1a3a1a3a1a3a1a3a1a3a1a3
a3a1a3a1a3a1a3a1a3a1a3a1a3
a3a1a3a1a3a1a3a1a3a1a3a1a3
a4a1a4a1a4a1a4a1a4a1a4a1a4
a4a1a4a1a4a1a4a1a4a1a4a1a4
a4a1a4a1a4a1a4a1a4a1a4a1a4
a4a1a4a1a4a1a4a1a4a1a4a1a4
a4a1a4a1a4a1a4a1a4a1a4a1a4
a4a1a4a1a4a1a4a1a4a1a4a1a4
u
U
v
s t
2 1
3 4
U
a5a6a5a6a5a6a5a6a5a6a5a6a5
a5a6a5a6a5a6a5a6a5a6a5a6a5
a5a6a5a6a5a6a5a6a5a6a5a6a5
a5a6a5a6a5a6a5a6a5a6a5a6a5
a5a6a5a6a5a6a5a6a5a6a5a6a5
a5a6a5a6a5a6a5a6a5a6a5a6a5
a7a6a7a6a7a6a7a6a7a6a7
a7a6a7a6a7a6a7a6a7a6a7
a7a6a7a6a7a6a7a6a7a6a7
a7a6a7a6a7a6a7a6a7a6a7
a7a6a7a6a7a6a7a6a7a6a7
a7a6a7a6a7a6a7a6a7a6a7
u
s S
e
Figure 2: Left: Chart parsing over the bitext cell of (s, t, u, v). Right: One of the four hooks built for
four corners for more ef cient parsing.
length, without loss of generality, we ignore the
nonterminal symbols X, Y , and Z to simplify the
derivation.
The inside algorithm in the context of bilexical
ITG is based on the following dynamic program-
ming equation:
β (e, s, t, u, v)
= summationdisplay
S,U,eprime



β1(e) · β3(eprime) · P([eprimee] | e)
+β2(e) · β4(eprime) · P(〈eeprime〉 | e)
+β3(e) · β1(eprime) · P([eeprime] | e)
+β4(e) · β2(eprime) · P(〈eprimee〉 | e)



So, on the right hand side, we sum up all possi-
ble ways (S, U) of splitting the left hand side cell
and all possible head words (eprime) for the non-head
subcell. e, eprime, s, t, u, v, S, and U all eight vari-
ables take O(n) values given that the lengths of
the source string and the target string are O(n).
Thus the entire DP algorithm takes O(n8) steps.
Fortunately, we can reduce the maximum num-
ber of interacting variables by factorizing the ex-
pression.
Let us keep the results of the summations over
eprime as:
β+1 (e) = summationdisplay
eprime
β1(eprime) · P([eeprime] | e)
β+2 (e) = summationdisplay
eprime
β2(eprime) · P(〈eprimee〉 | e)
β+3 (e) = summationdisplay
eprime
β3(eprime) · P([eprimee] | e)
β+4 (e) = summationdisplay
eprime
β4(eprime) · P(〈eeprime〉 | e)
The computation of each β+ involves four
boundary indices and two head words. So, we can
rely on DP to compute them in O(n6). Based on
these intermediate results, we have the equivalent
DP expression for computing inside probabilities:
β (e, s, t, u, v)
= summationdisplay
S,U



β1(e) · β+3 (e)
+ β2(e) · β+4 (e)
+ β3(e) · β+1 (e)
+ β4(e) · β+2 (e)



We reduced one variable from the original ex-
pression. The maximum number of interacting
variables throughout the algorithm is 7. So the im-
proved inside algorithm has a time complexity of
O(n7).
The trick of reducing interacting variables in DP
for bilexical parsing has been pointed out by Eis-
ner and Satta (1999). Melamed (2003) discussed
the applicability of the so-called hook trick for
parsing bilexical multitext grammars. The name
hook is based on the observation that we combine
the non-head constituent with the bilexical rule to
create a special constituent that matches the head
like a hook as demonstrated in Figure 2. How-
ever, for EM, it is not clear from their discussions
how we can do the hook trick in the outside pass.
The bilexical rules in all four directions are anal-
ogous. To simplify the derivation for the outside
algorithm, we just focus on the  rst case: straight
rule with right head word.
The outside probability of the constituent
(e, S, t, U, v) in cell 1 being a head of such rules
is:
summationdisplay
s,u,eprime
parenleftbigα(e) · β
3(eprime) · P([eprimee] | e)
parenrightbig
= summationdisplay
s,u
parenleftBigg
α(e) ·
parenleftBiggsummationdisplay
eprime
β3(eprime) · P([eprimee] | e)
parenrightBiggparenrightBigg
= summationdisplay
s,u
parenleftBig
α(e) · β+3 (e)
parenrightBig
which indicates we can reuse β+ of the lower left
neighbors of the head to make the computation
feasible in O(n7).
On the other hand, the outside probability for
(eprime, s, S, u, U) in cell 3 acting as a modi er of such
956
a rule is:
summationdisplay
t,v,e
parenleftbigα(e) · β
1(e) · P([eprimee] | e)
parenrightbig
= summationdisplay
e

P([eprimee] | e) ·

summationdisplay
t,v
α(e) · β1(e)




= summationdisplay
e
parenleftBig
P([eprime, e] | e) · α+3 (e)
parenrightBig
in which we memorize another kind of intermedi-
ate sum to make the computation no more complex
than O(n7).
We can think of α+3 as the outside probability
of the hook on cell 3 which matches cell 1. Gener-
ally, we need outside probabilities for hooks in all
four directions.
α+1 (e) = summationdisplay
s,u
α(e) · β3(e)
α+2 (e) = summationdisplay
t,u
α(e) · β4(e)
α+3 (e) = summationdisplay
t,v
α(e) · β1(e)
α+4 (e) = summationdisplay
s,v
α(e) · β2(e)
Based on them, we can add up the outside prob-
abilities of a constituent acting as one of the two
children of each applicable rule on top of it to get
the total outside probability.
We  nalize the derivation by simplifying the ex-
pression of the expected count of (e → [eprimee]).
EC(e → [eprimee])
= summationdisplay
s,t,u,v,S,U
parenleftbigP([eprimee] | e) · β
3(eprime) · α(e) · β1(e)
parenrightbig
= summationdisplay
s,S,u,U

P([eprimee] | e) · β3(eprime) ·

summationdisplay
t,v
α · β1




= summationdisplay
s,S,u,U
parenleftBig
P([eprimee] | e) · β3(eprime) · α+3 (e)
parenrightBig
which can be computed in O(n6) as long as we
have α+3 ready in a table. Overall we can do the
inside-outside algorithm for the bilexical ITG in
O(n7), by reducing a factor of n through interme-
diate DP.
The entire trick can be understood very clearly
if we imagine the bilexical rules are unary rules
that are applied on top of the non-head con-
stituents to reduce it to a virtual lexical constituent
(a hook) covering the same subcell while sharing
the head word with the head constituent. However,
if we build hooks looking for all words in a sen-
tence whenever a complete constituent is added to
the chart, we will build many hooks that are never
used, considering that the words outside of larger
cells are fewer and pruning might further reduce
the possible outside words. Blind guessing of what
might appear outside of the current cell will off-
set the saving we can achieve. Instead of actively
building hooks, which are intermediate results, we
can build them only when we need them and then
cache them for future use. So the construction of
the hooks will be invoked by the heads when the
heads need to combine with adjacent cells.
3.1 Pruning and Smoothing
We apply one of the pruning techniques used in
Zhang and Gildea (2005). The technique is gen-
eral enough to be applicable to any parsing algo-
rithm over bitext cells. It is called tic-tac-toe prun-
ing since it involves an estimate of both the inside
probability of the cell (how likely the words within
the box in both dimensions are to align) and the
outside probability (how likely the words outside
the box in both dimensions are to align). By scor-
ing the bitext cells and throwing away the bad cells
that fall out of a beam, it can reduce over 70% of
O(n4) cells using 10−5 as the beam ratio for sen-
tences up to 25 words in the experiments, without
harming alignment error rate, at least for the un-
lexicalized ITG.
The hook trick reduces the complexity of bilex-
ical ITG from O(n8) to O(n7). With the tic-tac-
toe pruning reducing the number of bitext cells to
work with, also due to the reason that the grammar
constant is very small for ITG. the parsing algo-
rithm runs with an acceptable speed,
The probabilistic model has lots of parameters
of word pairs. Namely, there are O(|V |2) de-
pendency probabilities and O(|V ||T|) translation
probabilities, where |V | is the size of English vo-
cabulary and |T| is the size of the foreign lan-
guage vocabulary. The translation probabilities of
P(f|X(e)) are backed off to a uniform distribu-
tion. We let the bilexical dependency probabilities
back off to uni-lexical dependencies in the follow-
ing forms:
P([Y (∗) Z(eprime)] | X(∗))
P([Y (eprime) Z(∗)] | X(∗))
P(〈Y (∗) Z(eprime)〉 | X(∗))
P(〈Y (eprime) Z(∗)〉 | X(∗))
957
 0
 100
 200
 300
 400
 500
 600
 700
 0  5  10  15  20
seconds
sentence length
without-hook
with-hook
 0
 20
 40
 60
 80
 100
 120
 140
 0  5  10  15  20  25
seconds
sentence length
without-hook
with-hook
(a) (b)
Figure 3: Speedup for EM by the Hook Trick. (a) is without pruning. In (b), we apply pruning on the
bitext cells before parsing begins.
The two levels of distributions are interpolated
using a technique inspired by Witten-Bell smooth-
ing (Chen and Goodman, 1996). We use the ex-
pected count of the left hand side lexical nontermi-
nal to adjust the weight for the EM-trained bilexi-
cal probability. For example,
P([Y (e) Z(eprime)] | X(e)) =
(1 − λ)PEM([Y (e) Z(eprime)] | X(e))
+ λP([Y (∗) Z(eprime)] | X(∗))
where
λ = 1/(1 + Expected Counts(X(e)))
4 Experiments
First of all, we are interested in  nding out how
much speedup can be achieved by doing the hook
trick for EM. We implemented both versions in
C++ and turned off pruning for both. We ran the
two inside-outside parsing algorithms on a small
test set of 46 sentence pairs that are no longer than
25 words in both languages. Then we put the re-
sults into buckets of (1 − 4), (5 − 9), (10 − 14),
(15−19), and (20−24) according to the maximum
length of two sentences in each pair and took av-
erages of these timing results. Figure 3 (a) shows
clearly that as the sentences get longer the hook
trick is helping more and more. We also tried to
turn on pruning for both, which is the normal con-
dition for the parsers. Both are much faster due
to the effectiveness of pruning. The speedup ratio
is lower because the hooks will less often be used
again since many cells are pruned away. Figure 3
(b) shows the speedup curve in this situation.
We trained both the unlexicalized and the lex-
icalized ITGs on a parallel corpus of Chinese-
English newswire text. The Chinese data were
automatically segmented into tokens, and English
capitalization was retained. We replaced words
occurring only once with an unknown word token,
resulting in a Chinese vocabulary of 23,783 words
and an English vocabulary of 27,075 words.
We did two types of comparisons. In the  rst
comparison, we measured the performance of  ve
word aligners, including IBM models, ITG, the
lexical ITG (LITG) of Zhang and Gildea (2005),
and our bilexical ITG (BLITG), on a hand-aligned
bilingual corpus. All the models were trained us-
ing the same amount of data. We ran the ex-
periments on sentences up to 25 words long in
both languages. The resulting training corpus had
18,773 sentence pairs with a total of 276,113 Chi-
nese words and 315,415 English words.
For scoring the Viterbi alignments of each sys-
tem against gold-standard annotated alignments,
we use the alignment error rate (AER) of Och
and Ney (2000), which measures agreement at the
level of pairs of words:
AER = 1 − |A ∩ GP| + |A ∩ GS||A| + |G
S|
where A is the set of word pairs aligned by the
automatic system, GS is the set marked in the
gold standard as  sure , and GP is the set marked
as  possible (including the  sure pairs). In our
Chinese-English data, only one type of alignment
was marked, meaning that GP = GS.
In our hand-aligned data, 47 sentence pairs are
no longer than 25 words in either language and
were used to evaluate the aligners.
A separate development set of hand-aligned
sentence pairs was used to control over tting. The
subset of up to 25 words in both languages was
used. We chose the number of iterations for EM
958
Alignment
Precision Recall Error Rate
IBM-1 .56 .42 .52
IBM-4 .67 .43 .47
ITG .68 .52 .41
LITG .69 .51 .41
BLITG .68 .51 .42
Dependency
Precision Recall Error Rate
ITG-lh .11 .11 .89
ITG-rh .22 .22 .78
LITG .13 .12 .88
BLITG .24 .22 .77
Table 1: Bilingual alignment and English dependency results on Chinese-English corpus (≤ 25 words on
both sides). LITG stands for the cross-language Lexicalized ITG. BLITG is the within-English Bilexical
ITG. ITG-lh is ITG with left-head assumption on English. ITG-rh is with right-head assumption.
Precision Recall AER
ITG .59 .60 .41
LITG .60 .57 .41
BLITG .58 .55 .44
Precision Recall DER
ITG-rh .23 .23 .77
LITG .11 .11 .89
BLITG .24 .24 .76
Table 2: Alignment and dependency results on a larger Chinese-English corpus.
training as the turning point of AER on the de-
velopment data set. The unlexicalized ITG was
trained for 3 iterations. LITG was trained for only
1 iteration, partly because it was initialized with
fully trained ITG parameters. BLITG was trained
for 3 iterations.
For comparison, we also included the results
from IBM Model 1 and Model 4. The numbers
of iterations for the training of the IBM models
were also chosen to be the turning points of AER
changing on the development data set.
We also want to know whether or not BLITG
can model dependencies better than LITG. For
this purpose, we also used the AER measurement,
since the goal is still getting higher precision/recall
for a set of recovered word links, although the de-
pendency word links are within one language. For
this reason, we rename AER to Dependency Error
Rate. Table 1(right) is the dependency results on
English side of the test data set. The dependency
results on Chinese are similar.
The gold standard dependencies were extracted
from Collins’ parser output on the sentences. The
LITG and BLITG dependencies were extracted
from the Viterbi synchronous trees by following
the head words.
For comparison, we also included two base-line
results. ITG-lh is unlexicalized ITG with left-head
assumption, meaning the head words always come
from the left branches. ITG-rh is ITG with right-
head assumption.
To make more con dent conclusions, we also
did tests on a larger hand-aligned data set used in
Liu et al. (2005). We used 165 sentence pairs that
are up to 25 words in length on both sides.
5 Discussion
The BLITG model has two components, namely
the dependency model on the upper levels of the
tree structure and the word-level translation model
at the bottom. We hope that the two components
will mutually improve one another. The current
experiments indicate clearly that the word level
alignment does help inducing dependency struc-
tures on both sides. The precision and recall on
the dependency retrieval sub-task are almost dou-
bled for both languages from LITG which only
has a kind of uni-lexical dependency in each lan-
guage. Although 20% is a low number, given the
fact that the dependencies are learned basically
through contrasting sentences in two languages,
the result is encouraging. The results slightly im-
prove over ITG with right-head assumption for
English, which is based on linguistic insight. Our
results also echo the  ndings of Kuhn (2004).
They found that based on the guidance of word
alignment between English and multiple other lan-
guages, a modi ed EM training for PCFG on En-
glish can bootstrap a more accurate monolingual
probabilistic parser. Figure 4 is an example of the
dependency tree on the English side from the out-
put of BLITG, comparing against the parser out-
put.
We did not  nd that the feedback from the de-
959
are
accomplishments
Economic reform
bright
for
cities
China
’s
14 open frontier
accomplishments
Economic reform frontier
open cities 14
bright
for are
’s
China
Figure 4: Dependency tree extracted from parser output vs. Viterbi dependency tree from BLITG
pendencies help alignment. To get the reasons, we
need further and deeper analysis. One might guess
that the dependencies are modeled but are not yet
strong and good enough given the amount of train-
ing data. Since the training algorithm EM has the
problem of local maxima, we might also need to
adjust the training algorithm to obtain good pa-
rameters for the alignment task. Initializing the
model with good dependency parameters is a pos-
sible adjustment. We would also like to point out
that alignment task is simpler than decoding where
a stronger component of reordering is required to
produce a  uent English sentence. Investigating
the impact of bilexical dependencies on decoding
is our future work.
Acknowledgments This work was supported
by NSF ITR IIS-09325646 and NSF ITR IIS-
0428020.
References
Albert V. Aho and Jeffery D. Ullman. 1972. The
Theory of Parsing, Translation, and Compiling, vol-
ume 1. Prentice-Hall, Englewood Cliffs, NJ.
Hiyan Alshawi, Srinivas Bangalore, and Shona Dou-
glas. 2000. Learning dependency translation mod-
els as collections of  nite state head transducers.
Computational Linguistics, 26(1):45 60.
Stanley F. Chen and Joshua Goodman. 1996. An em-
pirical study of smoothing techniques for language
modeling. In Proceedings of the 34th Annual Con-
ference of the Association for Computational Lin-
guistics (ACL-96), pages 310 318, Santa Cruz, CA.
ACL.
David Chiang. 2005. A hierarchical phrase-based
model for statistical machine translation. In Pro-
ceedings of the 43rd Annual Conference of the As-
sociation for Computational Linguistics (ACL-05),
pages 263 270, Ann Arbor, Michigan.
Jason Eisner and Giorgio Satta. 1999. Ef cient pars-
ing for bilexical context-free grammars and head au-
tomaton grammars. In 37th Annual Meeting of the
Association for Computational Linguistics.
Jonas Kuhn. 2004. Experiments in parallel-text based
grammar induction. In Proceedings of the 42nd An-
nual Conference of the Association for Computa-
tional Linguistics (ACL-04).
Yang Liu, Qun Liu, and Shouxun Lin. 2005. Log-
linear models for word alignment. In Proceedings
of the 43rd Annual Conference of the Association
for Computational Linguistics (ACL-05), Ann Ar-
bor, Michigan.
I. Dan Melamed. 2003. Multitext grammars and syn-
chronous parsers. In Proceedings of the 2003 Meet-
ing of the North American chapter of the Associ-
ation for Computational Linguistics (NAACL-03),
Edmonton.
Franz Josef Och and Hermann Ney. 2000. Improved
statistical alignment models. In Proceedings of the
38th Annual Conference of the Association for Com-
putational Linguistics (ACL-00), pages 440 447,
Hong Kong, October.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23(3):377 403.
Kenji Yamada and Kevin Knight. 2001. A syntax-
based statistical translation model. In Proceedings
of the 39th Annual Conference of the Association
for Computational Linguistics (ACL-01), Toulouse,
France.
Richard Zens and Hermann Ney. 2003. A compara-
tive study on reordering constraints in statistical ma-
chine translation. In Proceedings of the 40th Annual
Meeting of the Association for Computational Lin-
guistics, Sapporo, Japan.
Hao Zhang and Daniel Gildea. 2004. Syntax-based
alignment: Supervised or unsupervised? In Pro-
ceedings of the 20th International Conference on
Computational Linguistics (COLING-04), Geneva,
Switzerland, August.
Hao Zhang and Daniel Gildea. 2005. Stochastic lex-
icalized inversion transduction grammar for align-
ment. In Proceedings of the 43rd Annual Confer-
ence of the Association for Computational Linguis-
tics (ACL-05), Ann Arbor, MI.
960
