Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 777–784,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Left-to-Right Target Generation for Hierarchical Phrase-based
Translation
Taro Watanabe Hajime Tsukada Hideki Isozaki
2-4, Hikaridai, Seika-cho, Soraku-gun,
Kyoto, JAPAN 619-0237
{taro,tsukada,isozaki}@cslab.kecl.ntt.co.jp
Abstract
We present a hierarchical phrase-based
statistical machine translation in which a
target sentence is efficiently generated in
left-to-right order. The model is a class
of synchronous-CFG with a Greibach Nor-
mal Form-like structure for the projected
production rule: The paired target-side
of a production rule takes a phrase pre-
fixed form. The decoder for the target-
normalized form is based on an Early-
style top down parser on the source side.
The target-normalized form coupled with
our top down parser implies a left-to-
right generation of translations which en-
ables us a straightforward integration with
ngram language models. Our model was
experimented on a Japanese-to-English
newswire translation task, and showed sta-
tistically significant performance improve-
ments against a phrase-based translation
system.
1 Introduction
In a classical statistical machine translation, a for-
eign language sentence f J1 = f1, f2,... fJ is trans-
lated into another language, i.e. English, eI1 =
e1, e2,..., eI by seeking a maximum likely solution
of:
ˆeI1 = argmax
eI1
Pr(eI1|f J1 ) (1)
= argmax
eI1
Pr( f J1|eI1)Pr(eI1) (2)
The source channel approach in Equation 2 inde-
pendently decomposes translation knowledge into
a translation model and a language model, respec-
tively (Brown et al., 1993). The former repre-
sents the correspondence between two languages
and the latter contributes to the fluency of English.
In the state of the art statistical machine transla-
tion, the posterior probability Pr(eI1|f J1 ) is directly
maximized using a log-linear combination of fea-
ture functions (Och and Ney, 2002):
ˆeI1 = argmax
eI1
exp
parenleftBigsummationtextM
m=1 λmhm(e
I
1, f
J
1 )
parenrightBig
summationtext
e′I′1 exp
parenleftBigsummationtextM
m=1 λmhm(e′
I′
1 , f
J
1 )
parenrightBig (3)
where hm(eI1, f J1 ) is a feature function, such as
a ngram language model or a translation model.
When decoding, the denominator is dropped since
it depends only on f J1 . Feature function scaling
factors λm are optimized based on a maximum
likely approach (Och and Ney, 2002) or on a direct
error minimization approach (Och, 2003). This
modeling allows the integration of various fea-
ture functions depending on the scenario of how
a translation is constituted.
A phrase-based translation model is one of the
modern approaches which exploits a phrase, a
contiguous sequence of words, as a unit of transla-
tion (Koehn et al., 2003; Zens and Ney, 2003; Till-
man, 2004). The idea is based on a word-based
source channel modeling of Brown et al. (1993):
It assumes that eI1 is segmented into a sequence
of K phrases ¯eK1 . Each phrase ¯ek is transformed
into ¯fk. The translated phrases are reordered to
form f J1 . One of the benefits of the modeling is
that the phrase translation unit preserves localized
word reordering. However, it cannot hypothesize
a long-distance reordering required for linguisti-
cally divergent language pairs. For instance, when
translating Japanese to English, a Japanese SOV
structure has to be reordered to match with an En-
777
glish SVO structure. Such a sentence-wise move-
ment cannot be realized within the phrase-based
modeling.
Chiang (2005) introduced a hierarchical phrase-
based translation model that combined the
strength of the phrase-based approach and a
synchronous-CFG formalism (Aho and Ullman,
1969): A rewrite system initiated from a start
symbol which synchronously rewrites paired non-
terminals. Their translation model is a binarized
synchronous-CFG, or a rank-2 of synchronous-
CFG, in which the right-hand side of a production
rule contains at most two non-terminals. The form
can be regarded as a phrase translation pair with
at most two holes instantiated with other phrases.
The hierarchically combined phrases provide a
sort of reordering constraints that is not directly
modeled by a phrase-based model.
Rules are induced from a bilingual corpus with-
out linguistic clues first by extracting phrase trans-
lation pairs, and then by generalizing extracted
phrases with holes (Chiang, 2005). Even in a
phrase-based model, the number of phrases ex-
tracted from a bilingual corpus is quadratic to
the length of bilingual sentences. The grammar
size for the hierarchical phrase-based model will
be further exploded, since there exists numerous
combination of inserting holes to each rule. The
spuriously increasing grammar size will be prob-
lematic for decoding without certain heuristics,
such as a length based thresholding.
The integration with a ngram language model
further increases the cost of decoding especially
when incorporating a higher order ngram, such as
5-gram. In the hierarchical phrase-based model
(Chiang, 2005), and an inversion transduction
grammar (ITG) (Wu, 1997), the problem is re-
solved by restricting to a binarized form where at
most two non-terminals are allowed in the right-
hand side. However, Huang et al. (2005) reported
that the computational complexity for decoding
amounted to O(J3+3(n−1)) with n-gram even using
a hook technique. The complexity lies in mem-
orizing the ngram’s context for each constituent.
The order of ngram would be a dominant factor
for higher order ngrams.
As an alternative to a binarized form, we
present a target-normalized hierarchical phrase-
based translation model. The model is a class of a
hierarchical phrase-based model, but constrained
so that the English part of the right-hand side
is restricted to a Greibach Normal Form (GNF)-
like structure: A contiguous sequence of termi-
nals, or a phrase, is followed by a string of non-
terminals. The target-normalized form reduces the
number of rules extracted from a bilingual corpus,
but still preserves the strength of the phrase-based
approach. An integration with ngram language
model is straightforward, since the model gener-
ates a translation in left-to-right order. Our de-
coder is based on an Earley-style top down pars-
ing on the foreign language side. The projected
English-side is generated in left-to-right order syn-
chronized with the derivation of the foreign lan-
guage side. The decoder’s implementation is taken
after a decoder for an existing phrase-based model
with a simple modification to account for produc-
tion rules. Experimental results on a Japanese-to-
English newswire translation task showed signif-
icant improvement against a phrase-based model-
ing.
2 Translation Model
A weighted synchronous-CFG is a rewrite system
consisting of production rules whose right-hand
side is paired (Aho and Ullman, 1969):
X ←〈γ,α,∼〉 (4)
where X is a non-terminal, γ and α are strings of
terminals and non-terminals. For notational sim-
plicity, we assume that γ and α correspond to the
foreign language side and the English side, re-
spectively. ∼ is a one-to-one correspondence for
the non-terminals appeared in γ and α. Starting
from an initial non-terminal, each rule rewrites
non-terminals in γ and α that are associated with
∼.
Chiang (2005) proposed a hierarchical phrase-
based translation model, a binary synchronous-
CFG, which restricted the form of production rules
as follows:
• Only two types of non-terminals allowed: S
and X.
• Both of the strings γ and α must contain at
least one terminal item.
• Rules may have at most two non-terminals
but non-terminals cannot be adjacent for the
foreign language side γ.
The production rules are induced from a bilingual
corpus with the help of word alignments. To al-
leviate a data sparseness problem, glue rules are
778
added that prefer combining hierarchical phrases
in a serial manner:
S →
angbracketleftBig
S 1 X2 , S 1 X2
angbracketrightBig
(5)
S →
angbracketleftBig
X 1 , X1
angbracketrightBig
(6)
where boxed indices indicate non-terminal’s link-
ages represented in∼.
Our model is based on Chiang (2005)’s frame-
work, but further restricts the form of production
rules so that the aligned right-hand side α follows
a GNF-like structure:
X ←
angbracketleftBig
γ, ¯bβ,∼
angbracketrightBig
(7)
where ¯b is a string of terminals, or a phrase,
and beta is a (possibly empty) string of non-
terminals. The foreign language at right-hand side
γ still takes an arbitrary string of terminals and
non-terminals. The use of a phrase ¯b as a pre-
fix keeps the strength of the phrase-base frame-
work. A contiguous English side coupled with
a (possibly) discontiguous foreign language side
preserves a phrase-bounded local word reordering.
At the same time, the target-normalized frame-
work still combines phrases hierarchically in a re-
stricted manner.
The target-normalized form can be regarded as
a type of rule in which certain non-terminals are
always instantiated with phrase translation pairs.
Thus, we will be able to reduce the number of rules
induced from a bilingual corpus, which, in turn,
help reducing the decoding complexity.
The contiguous phrase-prefixed form generates
English in left-to-right order. Therefore, a decoder
can easily hypothesize a derivation tree integrated
with a ngram language model even with higher or-
der.
Note that we do not imply arbitrary
synchronous-CFGs are transformed into the
target normalized form. The form simply restricts
the grammar extracted from a bilingual corpus
explained in the next section.
2.1 Rule Extraction
We present an algorithm to extract production
rules from a bilingual corpus. The procedure is
based on those for the hierarchical phrase-based
translation model (Chiang, 2005).
First, a bilingual corpus is annotated with word
alignments using the method of Koehn et al.
(2003). Many-to-many word alignments are in-
duced by running a one-to-many word alignment
model, such as GIZA++ (Och and Ney, 2003), in
both directions and by combining the results based
on a heuristic (Koehn et al., 2003).
Second, phrase translation pairs are extracted
from the word alignment corpus (Koehn et al.,
2003). The method exhaustively extracts phrase
pairs ( f j+mj , ei+ni ) from a sentence pair ( f J1 , eI1) that
do not violate the word alignment constraints a:
∃(i′, j′)∈a : j′∈[ j, j + m], i′∈[i, i + n]
∄(i′, j′)∈a : j′∈[ j, j + m], i′ nelement [i, i + n]
∄(i′, j′)∈a : j′ nelement [ j, j + m], i′∈[i, i + n]
Third, based on the extracted phrases, production
rules are accumulated by computing the “holes”
for contiguous phrases (Chiang, 2005):
1. A phrase pair ( ¯f, ¯e) constitutes a rule
X →
angbracketleftBig ¯
f, ¯e
angbracketrightBig
2. A rule X →〈γ,α〉and a phrase pair ( ¯f, ¯e) s.t.
γ = γ′ ¯fγ′′ and α = ¯e′¯eβ constitutes a rule
X →
angbracketleftBig
γ′ X k γ′′, ¯e′ X k β
angbracketrightBig
Following Chiang (2005), we applied constraints
when inducing rules with non-terminals:
• At least one foreign word must be aligned to
an English word.
• Adjacent non-terminals are not allowed for
the foreign language side.
2.2 Phrase-based Rules
The rule extraction procedure described in Section
2.1 is a corpus-based, therefore will be easily suf-
fered from a data sparseness problem. The hier-
archical phrase-based model avoided this problem
by introducing the glue rules 5 and 6 that com-
bined hierarchical phrases sequentially (Chiang,
2005).
We use a different method of generalizing pro-
duction rules. When production rules without non-
terminals are extracted in step 1 of Section 2.1,
X →
angbracketleftBig ¯
f, ¯e
angbracketrightBig
(8)
then, we also add production rules as follows:
X →
angbracketleftBig ¯
f X 1 , ¯e X 1
angbracketrightBig
(9)
X →
angbracketleftBig
X 1 ¯f, ¯e X 1
angbracketrightBig
(10)
X →
angbracketleftBig
X 1 ¯f X 2 , ¯e X 1 X 2
angbracketrightBig
(11)
X →
angbracketleftBig
X 2 ¯f X 1 , ¯e X 1 X 2
angbracketrightBig
(12)
779
