Proceedings of the Workshop on Statistical Machine Translation, pages 130–133,
New York City, June 2006. c©2006 Association for Computational Linguistics
Stochastic Inversion Transduction Grammars for Obtaining Word Phrases
for Phrase-based Statistical Machine Translation
J.A. Sánchez and J.M. Benedí
Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia
Valencia, Spain
jandreu@dsic.upv.esjbenedi@dsic.upv.es
Abstract
An important problem that is related to
phrase-based statistical translation mod-
els is the obtaining of word phrases from
an aligned bilingual training corpus. In
this work, we propose obtaining word
phrases by means of a Stochastic Inver-
sion Translation Grammar. Experiments
on the shared task proposed in this work-
shop with the Europarl corpus have been
carried out and good results have been ob-
tained.
1 Introduction
Phrase-based statistical translation systems are cur-
rently providing excellent results in real machine
translation tasks (Zens et al., 2002; Och and Ney,
2003; Koehn, 2004). In phrase-based statistical
translation systems, the basic translation units are
word phrases.
An important problem that is related to phrase-
based statistical translation is to automatically ob-
tain bilingual word phrases from parallel corpora.
Several methods have been defined for dealing with
this problem (Och and Ney, 2003). In this work, we
study a method for obtaining word phrases that is
based on Stochastic Inversion Transduction Gram-
mars that was proposed in (Wu, 1997).
Stochastic Inversion Transduction Grammars
(SITG) can be viewed as a restricted Stochas-
tic Context-Free Syntax-Directed Transduction
Scheme. SITGs can be used to carry out a simulta-
neous parsing of both the input string and the output
string. In this work, we apply this idea to obtain
aligned word phrases to be used in phrase-based
translation systems (Sánchez and Benedí, 2006).
In Section 2, we review the phrase-based machine
translation approach. SITGs are reviewed in Sec-
tion 3. In Section 4, we present experiments on the
shared task proposed in this workshop with the Eu-
roparl corpus.
2 Phrase-based Statistical Machine
Transduction
The translation units in a phrase-based statistical
translation system are bilingual phrases rather than
simple paired words. Several systems that fol-
low this approach have been presented in recent
works (Zens et al., 2002; Koehn, 2004). These sys-
tems have demonstrated excellent translation perfor-
mance in real tasks.
The basic idea of a phrase-based statistical ma-
chine translation system consists of the following
steps (Zens et al., 2002): first, the source sentence is
segmented into phrases; second, each source phrase
is translated into a target phrase; and third, the target
phrases are reordered in order to compose the target
sentence.
Bilingual translation phrases are an important
component of a phrase-based system. Different
methods have been defined to obtain bilingual trans-
lations phrases, mainly from word-based alignments
and from syntax-based models (Yamada and Knight,
2001).
In this work, we focus on learning bilingual word
phrases by using Stochastic Inversion Transduction
Grammars (SITGs) (Wu, 1997). This formalism al-
130
lows us to obtain bilingual word phrases in a natu-
ral way from the bilingual parsing of two sentences.
In addition, the SITGs allow us to easily incorpo-
rate many desirable characteristics to word phrases
such as length restrictions, selection according to the
word alignment probability, bracketing information,
etc. We review this formalism in the following sec-
tion.
3 Stochastic Inversion Transduction
Grammars
Stochastic Inversion Transduction Grammars
(SITGs) (Wu, 1997) can be viewed as a restricted
subset of Stochastic Syntax-Directed Transduction
Grammars. They can be used to simultaneously
parse two strings, both the source and the target
sentences. SITGs are closely related to Stochastic
Context-Free Grammars.
Formally, a SITG in Chomsky Normal Form1
a0a2a1 can be defined as a tuple
a3a5a4a7a6a9a8a10a6a9a11a13a12a2a6a9a11a15a14a16a6a18a17a19a6a5a20a22a21 ,
where: a4 is a finite set of non-terminal symbols;
a8a24a23a25a4 is the axiom of the SITG; a11a13a12 is a finite set
of terminal symbols of language 1; and a11 a14 is a finite
set of terminal symbols of language 2. a17 is a finite
set of: lexical rules of the type a26a28a27a30a29a22a31a16a32 , a26a33a27a30a32a34a31a36a35 ,
a26a37a27 a29a22a31a36a35 ; direct syntactic rules that are noted as
a26 a27 a38a39a41a40a43a42 ; and inverse syntactic rules that are
noted as a26a44a27a46a45a5a39a47a40a49a48 , where a26 a6 a39 a6 a40 a23a50a4 , a29 a23a15a11 a12 ,
a35
a23a33a11a51a14 , and
a32 is the empty string. When a direct
syntactic rule is used in a parsing, both strings are
parsed with the syntactic rule a26a52a27a53a39a47a40 . When an
inverse rule is used in a parsing, one string is parsed
with the syntactic rule a26 a27 a39a41a40 , and the other
string is parsed with the syntactic rule a26a37a27 a40a54a39 .
Terma20 of the tuple is a function that attaches a prob-
ability to each rule.
An efficient Viterbi-like parsing algorithm that is
based on a Dynamic Programing Scheme is pro-
posed in (Wu, 1997). The proposed algorithm has
a time complexity of a55 a3a18a56a29 a56a58a57a59a56a35 a56a57a59a56a17a60a56a58a21 . It is important
to note that this time complexity restricts the use of
the algorithm to real tasks with short strings.
If a bracketed corpus is available, then a modi-
fied version of the parsing algorithm can be defined
to take into account the bracketing of the strings.
1A Normal Form for SITGs can be defined (Wu, 1997) by
analogy to the Chomsky Normal Form for Stochastic Context-
Free Grammars.
The modifications are similar to those proposed in
(Pereira and Schabes, 1992) for the inside algorithm.
Following the notation that is presented in (Pereira
and Schabes, 1992), we can define a partially brack-
eted corpus as a set of sentence pairs that are an-
notated with parentheses that mark constituent fron-
tiers. More precisely, a bracketed corpus a61 is a set of
tuples a3 a29 a6 a39a63a62 a6 a35 a6 a39a63a64 a21 , where a29 and a35 are strings, a39a54a62
is the bracketing of a29 , and a39a54a64 is the bracketing of a35 .
Let a65 a62a66a64 be a parsing of a29 and a35 with the SITG a0 a1 . If
the SITG does not have useless symbols, then each
non-terminal that appears in each sentential form
of the derivation a65a67a62a66a64 generates a pair of substrings
a29a69a68a71a70a66a70a66a70a18a29a73a72 of a29 , a74a76a75a78a77a49a75a44a79a25a75
a56
a29
a56, and
a35a59a80a81a70a66a70a66a70a18a35a83a82 of a35 ,
a74a47a75a33a84a85a75a44a86a87a75
a56
a35
a56, and defines a span a3
a77
a6
a79
a21 of
a29 and
a span a3 a84 a6 a86 a21 of a35 . A derivation of a29 and a35 is com-
patible with a39 a62 and a39 a64 if all the spans defined by
it are compatible with a39a54a62 and a39a63a64 . This compatibil-
ity can be easily defined by the function a88 a3 a77 a6 a79 a6 a84 a6 a86 a21 ,
which takes a value of a74 if a3 a77 a6 a79 a21 does not overlap any
a89 a23
a39a90a62 and, if a3 a84
a6
a86
a21 does not overlap any a89 a23
a39a91a64 ;
otherwise it takes a value of a92 . This function filters
those derivations (or partial derivations) whose pars-
ing is not compatible with the bracketing defined in
the sample (Sánchez and Benedí, 2006).
The algorithm can be implemented to compute
only those subproblems in the Dynamic Program-
ing Scheme that are compatible with the bracket-
ing. Thus, the time complexity is a55 a3a18a56a29 a56a93a57a59a56a35 a56a57a59a56a17a60a56a58a21 for
an unbracketed string, while the time complexity is
a55
a3a18a56
a29
a56a94a56
a35
a56a94a56a17a60a56a58a21 for a fully bracketed string. It is impor-
tant to note that the last time complexity allows us to
work with real tasks with longer strings.
Moreover, the parse tree can be efficiently ob-
tained. Each node in the tree relates two word
phrases of the strings being parsed. The related word
phrases can be considered to be the translation of
each other. These word phrases can be used to com-
pute the translation table of a phrase-based machine
statistical translation system.
4 Experiments
The experiments in this section were carried out for
the shared task proposed in this workshop. This
consisted of building a probabilistic phrase transla-
tion table for phrase-based statistical machine trans-
lation. Evaluation was translation quality on an un-
seen test set. The experiments were carried out using
131
the Europarl corpus (Koehn, 2005). Table 1 shows
the language pairs and some figures of the training
corpora. The test set had a0 a6 a92a2a1a4a3 sentences.
Languages Sentences # words (input/output)
De-En 751,088 15,257,871 / 16,052,702
Es-En 730,740 15,725,136 / 15,222,505
Fr-En 688,031 15,599,184 / 13,808,505
Table 1: Figures of the training corpora. The lan-
guages are English (En), French (Fr), German (De)
and Spanish (Es)
A common framework was provided to all the par-
ticipants so that the results could be compared. The
material provided comprised of: a training set, a lan-
guage model, a baseline translation system (Koehn,
2004), and a word alignment. The participants could
augment these items by using: their own training
corpus, their own sentence alignment, their own lan-
guage model, or their own decoder. We only used
the provided material for the experiments reported
in this work. The BLEU score was used to measure
the results.
A SITG was obtained for every language pair in
this section as described below. The SITG was used
to parse paired sentences in the training sample by
using the parsing algorithm described in Section 3.
All pairs of word phrases that were derived from
each internal node in the parse tree, except the root
node, were considered for the phrase-based machine
translation system. A translation table was obtained
from paired word phrases by placing them in the ad-
equate order and counting the number of times that
each pair appeared in the phrases. These values were
then appropriately normalized (Sánchez and Benedí,
2006).
4.1 Obtaining a SITG from an aligned corpus
For this experiment, a SITG was constructed for ev-
ery language pair as follows. The alignment was
used to compose lexical rules of the form a26 a27
a5
a31a7a6 . The probability of each rule was obtained by
counting. Then, two additional rules of the form
a26 a27 a38a26a63a26 a42 and a26 a27 a45a5a26a63a26a43a48 were added. It is im-
portant to point out that the constructed SITG did
not parse all the training sentences. Therefore, the
model was smoothed by adding all the rules of the
form a26 a27 a5 a31a16a32 and a26 a27 a32a34a31a7a6 with low probabil-
ity, so that all the training sentences could be parsed.
The rules were then adequately normalized.
This SITG was used to obtain word phrases from
the training corpus. Then, these word phrases were
used by the Pharaoh system (Koehn, 2004) to trans-
late the test set. We used word phrases up to a given
length. In these experiments several lengths were
tested and the best values ranged from 6 to 10. Ta-
ble shows 2 the obtained results and the size of the
translation table.
Lang. BLEU Lang. BLEU
De-En 15.91 (8.7) En-De 11.20 (9.7)
Es-En 22.85 (6.5) En-Es 21.18 (8.6)
Fr-En 21.30 (7.3) En-Fr 20.12 (8.1)
Table 2: Obtained results for different pairs and di-
rections. The value in parentheses is the number of
word phrases in the translation table (in millions).
Note that better results were obtained when En-
glish was the target language.
4.2 Using bracketing information in the
parsing
As Section 3 describes, the parsing algorithm for
SITGs can be adequately modified in order to take
bracketed sentences into account. If the bracket-
ing respects linguistically motivated structures, then
aligned phrases with linguistic information can be
used. Note that this approach requires having qual-
ity parsed corpora available. This problem can be
reduced by using automatically learned parsers.
This experiment was carried out to determine the
performance of the translation when some kind of
structural information was incorporated in the pars-
ing algorithm described in Section 3. We bracketed
the English sentences of the Europarl corpus with
an automatically learned parser. This automatically
learned parser was trained with bracketed strings ob-
tained from the UPenn Treebank corpus. We then
obtained word phrases according to the bracketing
by using the same SITG that was described in the
previous section. The obtained phrases were used
with the Pharaoh system. Table 3 shows the results
obtained in this experiment.
Note that the results decreased slightly in all
132
Lang. BLEU Lang. BLEU
De-En 15.13 (7.1) En-De 10.40 (9.2)
Es-En 21.61 (6.6) En-Es 19.86 (9.6)
Fr-En 20.57 (6.3) En-Fr 18.95 (8.3)
Table 3: Obtained results for different pairs and di-
rections when word phrases were obtained from a
parsed corpus.The value in parentheses is the num-
ber of word phrases in the translation table (in mil-
lions).
cases. This may be due to the fact that the bracket-
ing incorporated hard restrictions to the paired word
phrases and some of them were too forced. In ad-
dition, many sentences could not be parsed (up to
5% on average) due to the bracketing. However, it
is important to point out that incorporating bracket-
ing information to the English sentences notably ac-
celerated the parsing algorithm, thereby accelerating
the process of obtaining word phrases, which is an
important detail given the magnitude of this corpus.
4.3 Combining word phrases
Finally, we considered the combination of both
kinds of segments. The results can be seen in Ta-
ble 4. This table shows that the results improved the
results of Table 2 when English was the target lan-
guage. However, the results did not improve when
English was the source language. The reason for this
could be that both kinds of segments were different
in nature, and, therefore, the number of word phrases
increased notably, specially in the English part.
Lang. BLEU Lang. BLEU
De-En 16.39 (17.1) En-De 11.02 (15.3)
Es-En 22.96 (11.7) En-Es 20.86 (14.1)
Fr-En 21.73 (17.0) En-Fr 19.93 (14.9)
Table 4: Obtained results for different pairs and di-
rections when word phrases were obtained from a
non-parsed corpus and a parsed corpus.The value in
parentheses is the number of word phrases in the
translation table (in millions).
5 Conclusions
In this work, we have explored the problem of
obtaining word phrases for phrase-based machine
translation systems from SITGs. We have described
how the parsing algorithms for this formalism can
be modified in order to take into account a brack-
eted corpus. If bracketed corpora are used the time
complexity can decrease notably and large tasks can
be considered. Experiments were reported for the
Europarl corpus, and the results obtained were com-
petitive.
For future work, we propose to work along dif-
ferent lines: first, to incorporate new linguistic in-
formation in both the parsing algorithm and in the
aligned corpus; second, to obtain better SITGs from
aligned bilingual corpora; an third, to improve the
SITG by estimating the syntactic rules. We also in-
tend to address other machine translation tasks.
Acknowledgements
This work has been partially supported by the Uni-
versidad Politécnica de Valencia with the ILETA
project.

References
P. Koehn. 2004. Pharaoh: a beam search decoder for
phrase-based statistical machine translation models.
In Proc. of AMTA.
P. Koehn. 2005. Europarl: A parallel corpus for statisti-
cal machine translation. In Proc. of MT Summit.
F.J. Och and H. Ney. 2003. A systematic comparison of
various statistical alignment models. Computational
Linguistics, 29(1):19–52.
F. Pereira and Y. Schabes. 1992. Inside-outside reesti-
mation from partially bracketed corpora. In Proceed-
ings of the 30th Annual Meeting of the Association for
Computational Linguistics, pages 128–135. University
of Delaware.
J.A. Sánchez and J.M. Benedí. 2006. Obtaining word
phrases with stochastic inversion transduction gram-
mars for phrase-based statistical machine translation.
In Proc. 11th Annual conference of the European Asso-
ciation for Machine Translation, page Accepted, Oslo,
Norway.
D. Wu. 1997. Stochastic inversion transduction gram-
mars and bilingual parsing of parallel corpora. Com-
putational Linguistics, 23(3):377–404.
K. Yamada and K. Knight. 2001. A syntax-based sta-
tistical translation model. In Proc. of the 39th Annual
Meeting of the Association of Computational Linguis-
tics, pages 523–530.
R. Zens, F.J. Och, and H. Ney. 2002. Phrase-based statis-
tical machine translation. In Proc. of the 25th Annual
German Conference on Artificial Intelligence, pages
18–32.
