Approximating Context-Free by Rational Transduction for
Example-Based MT
Mark-Jan Nederhofa0
AT&T Labs-Research, 180 Park Avenue, Florham Park, NJ 07932
and
Alfa Informatica (RUG), P.O. Box 716, NL-9700 AS Groningen, The Netherlands
Abstract
Existing studies show that a weighted
context-free transduction of reasonable
quality can be effectively learned from
examples. This paper investigates the
approximation of such transduction by
means of weighted rational transduc-
tion. The advantage is increased pro-
cessing speed, which benefits real-
time applications involving spoken lan-
guage.
1 Introduction
Several studies have investigated automatic or
partly automatic learning of transductions for ma-
chine translation. Some of these studies have con-
centrated on finite-state or extended finite-state
machinery, such as (Vilar and others, 1999), oth-
ers have chosen models closer to context-free
grammars and context-free transduction, such as
(Alshawi et al., 2000; Watanabe et al., 2000; Ya-
mamoto and Matsumoto, 2000), and yet other
studies cannot be comfortably assigned to either
of these two frameworks, such as (Brown and oth-
ers, 1990) and (Tillmann and Ney, 2000).
In this paper we will investigate both context-
free and finite-state models. The basis for our
study is context-free transduction since that is a
powerful model of translation, which can in many
cases adequately describe the changes of word
a1 The second address is the current contact address; sup-
ported by the Royal Netherlands Academy of Arts and Sci-
ences; current secondary affiliation is the German Research
Center for Artificial Intelligence (DFKI).
order between two languages, and the selection
of appropriate lexical items. Furthermore, for
limited domains, automatic learning of weighted
context-free transductions from examples seems
to be reasonably successful.
However, practical algorithms for computing
the most likely context-free derivation have a cu-
bic time complexity, in terms of the length of
the input string, or in the case of a graph out-
put by a speech recognizer, in terms of the num-
ber of nodes in the graph. For certain lexicalized
context-free models we even obtain higher time
complexities when the size of the grammar is not
to be considered as a parameter (Eisner and Satta,
1999). This may pose problems, especially for
real-time speech systems.
Therefore, we have investigated approximation
of weighted context-free transduction by means
of weighted rational transduction. The finite-state
machinery for implementing the latter kind of
transduction in general allows faster processing.
We can also more easily obtain robustness. We
hope the approximating model is able to preserve
some of the accuracy of the context-free model.
In the next section, we discuss preliminary def-
initions, adapted from existing literature, mak-
ing no more than small changes in presentation.
In Section 3 we explain how context-free trans-
duction grammars can be represented by ordinary
context-free grammars, plus a phase of postpro-
cessing. The approximation is discussed in Sec-
tion 4. As shown in Section 5, we may easily
process input in a robust way, ensuring we always
obtain output. Section 6 discusses empirical re-
sults, and we end the paper with conclusions.
2 Preliminaries
2.1 hierarchical alignment
The input to our algorithm is a corpus consisting
of pairs of sentences related by an hierarchical
alignment (Alshawi et al., 2000). In what follows,
the formalization of this concept has been slightly
changed with respect to the above reference, to
suit our purposes in the remainder of this article.
The hierarchically aligned sentence pairs in the
corpus are 5-tuples a2a4a3a6a5a8a7a9a3a11a10a11a7a13a12a14a5a15a7a16a12a8a10a17a7a19a18a21a20 satisfying
the following. The first two components, a3a6a5 and
a3a11a10 , are strings, called the source string and the
target string, respectively, the lengths of which
are denoted by a22 a5a24a23 a25a3 a5a11a25 and a22 a10a26a23 a25a3 a10a27a25. We
let a28 a5 and a28 a10 denote the sets of string positions
a29a11a30a11a7a32a31a32a31a32a31a33a7
a22
a5a8a34 and a29a11a30a11a7a32a31a32a31a32a31a35a7
a22
a10a11a34 respectively.
Further, a12a14a5 (resp. a12a8a10 ) is a mapping from posi-
tions in a28
a5a37a36a38a29a11a39 a34 (resp.
a28
a10a40a36a41a29a11a39 a34 ) to pairs of
lists of positions from a28 a5 (resp. a28 a10 ), satisfying
the following: if a position a42 is mapped to a pair
a2a44a43a45a5a8a7a32a43a46a10a14a20 , then the positions in the list a43a45a5a16a47a6a48
a42a50a49
a47a14a43a51a10 are
in strictly increasing order; we let “a47 ” denote list-
concatenation, and a48a42a50a49 represents a list consisting
of a single element a42 .
Each position in a28 a5 (resp. a28 a10 ) should occur
at most once in the image of a12a14a5 (resp. a12a8a10 ). This
means that a12 a5 and a12 a10 assign dependency struc-
tures to the source and target strings.
A further restriction on a12a14a5 and a12a8a10 requires
some auxiliary definitions. Let a12 be either a12 a5
or a12a8a10 . We define a52a12 as the function that maps
each position a42 to the list of positions a52a12a53a2a4a54a11a5a8a20a55a47
a31a32a31a32a31a56a47
a52
a12a53a2a4a54a15a57a59a58a60a20a61a47a62a48
a42a50a49
a47
a52
a12a53a2a44a63a27a5a32a20a61a47a56a31a32a31a32a31a64a47
a52
a12a65a2a44a63a66a57a68a67a8a20 when
a12a69a2
a42
a20 a23 a2a33a48a54a17a5a32a7a32a31a32a31a32a31a35a7a19a54a15a57 a58
a49
a7a70a48a63a27a5a8a7a32a31a32a31a32a31a33a7a32a63a66a57 a67
a49
a20 . If a3 is a
string a71 a5a53a47a32a47a32a47 a71a73a72 , and a43 is a list a48a54a17a5a8a7a32a31a32a31a32a31a33a7a19a54a8a57 a49 of string
positions in a3 , then a3a53a74a75a43 represents the string
a71a35a76
a58 a47a32a47a32a47
a71a73a76a19a77 . If a42 is a single position, then
a3a53a74
a42 rep-
resents the symbol a71a35a78 .
We now say that a12 is projective if a52
a12 maps each
position a42 to some interval of positions a48a79a27a7a19a79a81a80
a30a11a7a32a47a32a47a32a47a33a7a32a82a84a83a85a30a11a7a32a82
a49 . We will assume that both
a12a14a5
and a12a8a10 are projective. (Strictly speaking, our al-
gorithm would still be applicable if they were
not projective, but it would treat the hierarchical
alignment as if the symbols in the source and tar-
get strings had been reordered to make a12a14a5 and
a12 a10 projective.) Furthermore, a reasonable hier-
archical alignment satisfies a52a12a53a2a44a39a11a20 a23 a48a39a11a7a32a30a11a7a32a31a32a31a32a31a35a7 a22a69a49 ,
where a22 a23 a22 a5 or a22 a23 a22 a10 when a12 a23 a12a14a5 or a12 a23 a12a8a10 ,
respectively, which means that all symbols in the
string are indirectly linked to the ‘dummy’ posi-
tion 0.
Lastly, a18 is the union of a29a11a2a44a39a11a7a32a39a11a20 a34 and a subset of
a28
a5a87a86
a28
a10 that relates positions in the two strings.
It is such that a2 a42 a5 a7a19a54a88a20a44a7a32a2 a42 a10 a7a19a54a88a20a90a89a91a18 imply a42 a5a92a23 a42 a10
and a2 a42 a7a19a54a11a5a32a20a44a7a32a2 a42 a7a19a54a15a10a66a20a92a89a93a18 imply a54a11a5 a23 a54a60a10 ; in other
words, a position in one string is related to at most
one position in the other. Furthermore, for each
a2
a42
a7a19a54a94a20a95a89a91a18a84a83a96a29a11a2a44a39a11a7a32a39a11a20 a34 there is a pair a2
a42a98a97
a7a19a54
a97
a20a90a89a99a18
such that a42 occurs in one of the two lists of a12 a5 a2 a42a98a97 a20
and a54 occurs in one of the two lists of a12a8a10a11a2a4a54 a97 a20 ; this
means that positions can only be related if their
respective “mother” positions are related.
Note that this paper does not discuss how hi-
erarchical alignments can be obtained from unan-
notated corpora of bitexts. This is the subject of
existing studies, such as (Alshawi et al., 2000).
2.2 context-free transduction
Context-free transduction was originally called
syntax-directed transduction in (Lewis II and
Stearns, 1968), but since in modern formal lan-
guage theory and computational linguistics the
term “syntax” has a much wider range of mean-
ings than just “context-free syntax”, we will not
use the original term here.
A (context-free) transduction grammar is a 5-
tuple a2a44a100a90a7a32a101a37a5a8a7a32a101a102a10a11a7a32a103a104a7a32a105a104a20 , where a100 is a finite set of
nonterminals, a105a38a89a38a100 is the start symbol, a101a37a5 and
a101a102a10 are the source and target alphabets, and a103 is a
finite set of productions of the form a106a96a107 a2a44a108a13a7a19a109a56a20 ,
where a106 a89a110a100 , a108a111a89a110a2a44a100a81a36a56a101a68a5a60a20 a0 anda109a112a89a110a2a44a100a81a36a64a101a102a10a11a20 a0 ,
such that each nonterminal in a108 occurs exactly
once in a109 and each nonterminal in a109 occurs ex-
actly once in a108 .1
If we were to replace each RHS pair by only
its first part a108 , we would obtain a context-free
grammar for the source language, and if we were
to replace each RHS pair by its second part a109 ,
we would obtain a context-free grammar for the
target language. The combination of the two
halves of such a RHS indicates how a parse for
1Note that we ignore the case that a single nonterminal
occurs twice or more in a113 or a114 ; if we were to include this
case, some tedious complications of notation would result,
without any theoretical gain such as an increase of genera-
tive power. We refer to (Lewis II and Stearns, 1968) for the
general case.
the source language can be related to a parse for
the target language, and this defines a transduc-
tion between the languages in an obvious way.
An example of a transduction grammar is:
a115
a107
a2 Subj-IObj “like” Obj-Subja7
Obj-Subj Subj-IObj “plaˆıt”a20
Subj-IObj a107 a2 “I”a7 “me”a20
Obj-Subj a107 a2 “him”a7 “il”a20
This transduction defines that a sentence “I like
him” can be translated by “il me plaˆıt”.
We can reduce the generative power of context-
free transduction grammars by a syntactic restric-
tion that corresponds to the bilexical context-free
grammars (Eisner and Satta, 1999). Let us define
a bilexical transduction grammar as a transduc-
tion grammar which is such that:
a116 there is a mapping from the set of nontermi-
nals to a101a37a5a14a86a117a101a102a10 . Due to this property, we may
write each nonterminal as a106 a48a71 a7a32a118 a49 to indicate
that it is mapped to the pair a2 a71 a7a32a118a8a20 , where
a71
a89a26a101a37a5 and a118a61a89a26a101a102a10 , where
a106 is a so called
delexicalized nonterminal. We may write a105
as a106 a48a119a120a7a32a119 a49 , where a119 is a dummy symbol at
the dummy string position a39 . Further,
a116 each production is of one of the following
five forms:
a106
a48
a71
a7a32a118
a49a16a107
a2a44a121a122a48
a71
a7a32a118
a49a98a123
a48a124a8a7a32a125
a49
a7a117a121a126a48
a71
a7a32a118
a49a98a123
a48a124a8a7a32a125
a49
a20
a106
a48
a71
a7a32a118
a49a16a107
a2a44a121a122a48
a71
a7a32a118
a49a98a123
a48a124a8a7a32a125
a49
a7
a123
a48a124a8a7a32a125
a49
a121a122a48
a71
a7a32a118
a49
a20
a106
a48
a71
a7a32a118
a49a16a107
a2
a123
a48a124a8a7a32a125
a49
a121a122a48
a71
a7a32a118
a49
a7a117a121a126a48
a71
a7a32a118
a49a98a123
a48a124a8a7a32a125
a49
a20
a106
a48
a71
a7a32a118
a49a16a107
a2
a123
a48a124a8a7a32a125
a49
a121a122a48
a71
a7a32a118
a49
a7
a123
a48a124a8a7a32a125
a49
a121a122a48
a71
a7a32a118
a49
a20
a106
a48
a71
a7a32a118
a49a16a107
a2
a71
a7a32a118a8a20
For convenience, we also allow productions of
the form:
a106
a48
a71
a7a32a118
a49a127a107
a2a4a128a16a5a53a121a122a48
a71
a7a32a118
a49a32a129
a5a8a7a130a128a27a10a102a121a122a48
a71
a7a32a118
a49a32a129
a10a14a20
where a128a16a5a15a7 a129
a5a131a89a110a101 a0
a5 and
a128a65a10a14a7
a129
a10a55a89a81a101 a0
a10 .
In the experiments in Section 6, we also con-
sider nonterminals that are lexicalized only by the
source alphabet, which means that these nonter-
minals can be written as a106 a48a71a132a49 , where a71 a89a24a101a37a5 . The
motivation is to restrict the grammar size and to
increase the coverage.
Bilexical transduction grammars are equivalent
to the dependency transduction model from (Al-
shawi et al., 2000).
2.3 obtaining a context-free transduction
from the corpus
We extract a context-free transduction grammar
from a corpus of hierarchical alignments, by lo-
cally translating each hierarchical alignment into
a set of productions. The union of all these sets for
the whole corpus is then the transduction gram-
mar. Counting the number of times that identi-
cal productions are generated allows us to assign
probabilities to the productions by maximum like-
lihood estimation.
We will consider a method that uses only one
delexicalized nonterminal a106 . For a pair a2 a42 a7 a42a98a97 a20a55a89
a18 , we have a nonterminal
a106
a48a3a6a5a15a74
a42
a7a19a3a11a10a14a74
a42 a97 a49 or a
nonterminal a106
a48a3a6a5a8a74
a42a98a49 , depending on whether non-
terminals are lexicalized by both source and target
alphabets, or by just the source alphabet. Let us
call that nonterminal a133a135a134a14a136a9a137 a2 a42 a7 a42a98a97 a20 .
Each pair of positions a2 a42 a7 a42 a97 a20a138a89a139a18 gives rise to
one production. Suppose that
a12a14a5a15a2
a42
a20 a23 a2a33a48a54a17a5a8a7a32a31a32a31a32a31a35a7a19a54a15a57a62a58
a49
a7a70a48a63a65a5a8a7a32a31a32a31a32a31a33a7a32a63a66a57a68a67
a49
a20
and each position in this pair is related by a18
to some position from a28 a10 , which we will call
a54
a97
a5
a7a32a31a32a31a32a31a132a7a19a54
a97
a57a59a58 a7a32a63
a97
a5
a7a32a31a32a31a32a31a132a7a32a63
a97
a57a68a67 , respectively, and simi-
larly, suppose that
a12 a10 a2
a42 a97
a20 a23 a2a33a48a54
a97a97
a5
a7a32a31a32a31a32a31a33a7a19a54
a97a97
a57a37a140
a49
a7a70a48a63
a97a97
a5
a7a32a31a32a31a32a31a33a7a32a63
a97a97
a57a117a141
a49
a20
and each position in this pair is related by a18
to some position from a28 a5 , which we will call
a54
a97a97a97
a5
a7a32a31a32a31a32a31a33a7a19a54
a97a97a97
a57 a140
a7a32a63
a97a97a97
a5
a7a32a31a32a31a32a31a33a7a32a63
a97a97a97
a57 a141 . Then the production
is given by
a133a135a134a14a136a9a137
a2
a42
a7
a42 a97
a20
a107
a2
a133a68a134a14a136a6a137
a2a4a54 a5 a7a19a54
a97
a5 a20a138a47a32a47a32a47
a133a68a134a70a136a9a137
a2a4a54 a57a62a58 a7a19a54
a97
a57a62a58 a20a6a3
a5 a74
a42
a133a68a134a14a136a6a137
a2a44a63a65a5a8a7a32a63
a97
a5
a20a138a47a32a47a32a47
a133a68a134a70a136a9a137
a2a44a63a70a57a68a67a8a7a32a63
a97
a57 a67
a20a44a7
a133a135a134a14a136a9a137
a2a4a54
a97a97
a5 a7a19a54
a97a97a97
a5 a20a138a47a32a47a32a47
a133a135a134a14a136a9a137
a2a4a54
a97a97
a57a37a140 a7a19a54
a97a97a97
a57a68a140 a20a9a3a11a10a11a74
a42 a97
a133a68a134a14a136a6a137
a2a44a63
a97a97
a5
a7a32a63
a97a97a97
a5
a20a40a47a32a47a32a47
a133a68a134a70a136a9a137
a2a44a63
a97a97
a57a117a141
a7a32a63
a97a97a97
a57a117a141
a20a142a20
Note that both halves of the RHS contain the same
nonterminals but possibly in a different order.
However, if any position in a12a14a5a15a2 a42 a20 or a12a8a10a11a2 a42a98a97 a20 is
not related to some other position by a18 , then the
production above contains, instead of a nontermi-
nal, a substring on which that position is projected
by a52a12a14a5 or a52a12a8a10 , respectively. E.g. if there is no po-
sition a54 a97a5 such that a2a4a54 a5 a7a19a54 a97a5 a20a122a89a143a18 , then instead of
a133a135a134a14a136a9a137
a2a4a54a17a5a8a7a19a54
a97
a5
a20 we have the string a3a6a5a8a74
a52
a12a14a5a60a2a4a54a17a5a8a20 .
In general, we cannot adapt the above algo-
rithm to produce transduction grammars that are
bilexical. For example, a production of the form:
a106
a48
a71
a7
a71a35a97a144a49a142a107
a2
a106
a48a118a8a7a32a118
a97a145a49a6a106
a48a124a8a7a32a124
a97a146a49a6a71
a7
a106
a48a124a8a7a32a124
a97a146a49a6a106
a48a118a8a7a32a118
a97a144a49a6a71a35a97
a20
cannot be broken up into smaller, bilexical pro-
ductions.2 However, the hierarchical alignments
that we work with were produced by an algorithm
that ensures that bilexical grammars suffice. For-
mally, this applies when the following cannot oc-
cur: there are a42 a7 a42 a5a8a7 a42 a10a147a89 a28 a5 and a54a73a7a19a54a11a5a32a7a19a54a60a10a147a89 a28 a10
such that a2 a42
a7a19a54a94a20a126a89a91a18 ,
a42
a5 and
a42
a10 occur in a12a69a2
a42
a20 , a54a11a5
and a54 a10 occur in a12a69a2a4a54a88a20 and a2 a42 a5 a7a19a54 a5 a20a44a7a32a2 a42 a10 a7a19a54 a10 a20a102a89a95a18 , and
eithera42 a5a131a148 a42 a10a55a148 a42 anda54a60a10a55a148a147a54a17a5a117a148a149a54 , ora42 a148 a42 a5a131a148 a42 a10
and a54a138a148a38a54a60a10a55a148a149a54a11a5 , or a42 a5a131a148 a42 a10a150a148 a42 and a54a75a148a149a54a11a5a131a148a147a54a60a10 ,
or a42 a148 a42 a5a131a148 a42 a10 and a54a17a5a131a148a149a54a15a10a150a148a151a54 .
For example, if the non-bilexical production we
would obtain is:
a106
a48
a71
a7
a71a33a97a146a49a70a107
a2
a106
a48a118a8a7a32a118
a97a146a49
a125
a71a135a106
a48a124a15a7a32a124
a97a146a49
a7
a106
a48a124a8a7a32a124
a97 a49a27a152a153a106
a48a118a8a7a32a118
a97 a49a6a71 a97
a20
then the bilexical transduction grammar that our
algorithm produces contains:
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a106
a48
a71
a7
a71 a97 a49a27a106
a48a124a8a7a32a124
a97 a49
a7
a106
a48a124a8a7a32a124
a97 a49a6a106
a48
a71
a7
a71 a97 a49
a20
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a106
a48
a71
a7
a71 a97 a49
a7
a152a153a106
a48
a71
a7
a71 a97 a49
a20
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a106
a48a118a8a7a32a118
a97 a49a27a106
a48
a71
a7
a71 a97 a49
a7
a106
a48a118a15a7a32a118
a97 a49a27a106
a48
a71
a7
a71 a97 a49
a20
a106
a48
a71
a7
a71 a97 a49a127a107
a2a44a125
a106
a48
a71
a7
a71 a97 a49
a7
a106
a48
a71
a7
a71 a97 a49
a20
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a71
a7
a71 a97
a20
3 Reordering as postprocessing
In the following section we will discuss an algo-
rithm that was devised for context-free grammars.
To make it applicable to transduction, we propose
a way to represent bilexical transduction gram-
mars as ordinary context-free grammars. In the
new productions, symbols from the source and
target alphabets occur side by side, but whereas
source symbols are matched by the parser to the
input, the target symbols are gathered into output
strings. In our case, the unique output string the
parser eventually produces from an input string
is obtained from the most likely derivation that
matches that input string.
2That bilexical transduction grammars are less power-
ful than arbitrary context-free transduction grammars can
be shown formally; cf. Section 3.2.3 of (Aho and Ullman,
1972).
That the nonterminals in both halves of a RHS
in the transduction grammar may occur in a dif-
ferent order is solved by introducing three special
symbols, the reorder operators, which are inter-
preted after the parsing phase. These three opera-
tors will be written as “a48”, “a25” and “a49 ”. In a given
string, there should be matching triples of these
operators, in such a way that if there are two such
triples, then they either occur in two isolated sub-
strings, or one occurs nested between the “a48” and
the “a25” or nested between the “a25” and the “a49 ” of the
other triple. The interpretation of an occurrence
of a triple, say in an output string a3a6a5a17a48a3a11a10 a25a3a11a154 a49 a3a11a155 ,
is that the two enclosed substrings should be re-
ordered, so that we obtain a3a6a5a156a3a11a154a60a3a11a10a8a3a11a155 .
Both the reorder operators and the symbols of
the target alphabet will here be marked by a hor-
izontal line to distinguish them from the source
alphabet. For example, the two productions
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a106
a48
a71
a7
a71 a97 a49a27a106
a48a124a8a7a32a124
a97 a49
a7
a106
a48a124a8a7a32a124
a97 a49a6a106
a48
a71
a7
a71 a97 a49
a20
a106
a48
a71
a7
a71 a97 a49a127a107
a2
a71
a7
a71 a97
a20
from the transduction grammar are represented by
the following two context-free productions:
a106
a48
a71
a7
a71 a97 a49a157a107
a48
a106
a48
a71
a7
a71 a97 a49
a25
a106
a48a124a8a7a32a124
a97 a49
a25
a106
a48
a71
a7
a71 a97 a49a157a107 a71 a71 a97
In the first production, the RHS nonterminals oc-
cur in the same order as in the left half of the orig-
inal production, but reorder operators have been
added to indicate that, after parsing, some sub-
strings of the output string are to be reordered.
Our reorder operators are similar to the two op-
erators a148 and a158 from (Vilar and others, 1999),
but the former are more powerful, since the latter
allow only single words to be moved instead of
whole phrases.
4 Finite-state approximation
There are several methods to approximate
context-free grammars by regular languages
(Nederhof, 2000). We will consider here only the
so called RTN method, which is applied in a sim-
plified form.3
3As opposed to (Nederhof, 2000), we assume here that
all nonterminals are mutually recursive, and the grammar
contains self-embedding. We have observed that typical
grammars that we obtain in the context of this article indeed
have the property that almost all nonterminals belong to the
same mutually recursive set.
A finite automaton is constructed as follows.
For each nonterminal a106 from the grammar we in-
troduce two states a82a160a159 and a82 a97a159 . For each produc-
tion a106a161a107a163a162
a5a53a47a32a47a32a47
a162
a57 we introduce
a164
a80a26a30 states
a82a44a165a11a7a32a31a32a31a32a31a33a7a32a82 a57 , and we add epsilon transitions from
a82a160a159 to a82 a165 and from a82a160a57 to a82
a97
a159 . The initial state of
the automaton is a82a167a166 and the only final state is a82 a97a166 ,
where a105 is the start symbol of the grammar.
If a symbol a162 a78 in the RHS of a production is
a terminal, then we add a transition from a82 a78a169a168 a5 to
a82
a78 labelled by a162a170a78 . If a symbol a162a170a78 in the RHS is
a nonterminal a121 , then we add epsilon transitions
from a82 a78a171a168
a5 to a82a160a172 and from a82
a97
a172 to
a82
a78 .
The resulting automaton is determinized and
minimized to allow fast processing of input. Note
that if we apply the approximation to the type of
context-free grammar discussed in Section 3, the
transitions include symbols from both source and
target alphabets, but we treat both uniformly as in-
put symbols for the purpose of determinizing and
minimizing. This means that the driver for the
finite automaton still encounters nondeterminism
while processing an input string, since a state may
have several outgoing transitions for different out-
put symbols.
Furthermore, we ignore any weights that might
be attached to the context-free productions, since
determinization is problematic for weighted au-
tomata in general and in particular for the type
of automaton that we would obtain when carry-
ing over the weights from the context-free gram-
mar onto the approximating language following
(Mohri and Nederhof, 2001).
Instead, weights for the transitions of the fi-
nite automaton are obtained by training, using
strings that are produced as a side effect of the
computation of the grammar from the corpus.
These strings contain the symbols from both the
source and target strings mixed together, plus oc-
currences of the reorder operators where needed.
A English/French example might be:
a48 I me like plaˆıt a25 him il
a49
The way these strings were obtained ensures that
they are included in the language generated by
the context-free grammar, and they are therefore
also accepted by the approximating automaton
due to properties of the RTN approximation. The
weights are the negative log of the probabilities
obtained by maximum likelihood estimation.
5 Robustness
The approximating finite automaton cannot en-
sure that the reorder operators “a48”, “a25” and “a49 ” oc-
cur in matching triples in output strings. There
are two possible ways to deal with this problem.
First, we could extend the driver of the finite au-
tomaton to only consider derivations in which the
operators are matched. This is however counter
to our need for very efficient processing, since we
are not aware of any practical algorithms for find-
ing matching brackets in paths in a graph of which
the complexity is less than cubic.
Therefore, we have chosen a second approach,
viz. to make the postprocessing robust, by in-
serting missing occurrences of “a48” or “a49 ” and re-
moving redundant occurrences of brackets. This
means that any string containing symbols from
the target alphabet and occurrences of the reorder
operators is turned into a string without reorder
operators, with a change of word order where nec-
essary.
Both the transduction grammar and, to a lesser
extent, the approximating finite automaton suffer
from not being able to handle all strings of sym-
bols from the source alphabet. With finite-state
processing however, it is rather easy to obtain ro-
bustness, by making the following three provi-
sions:
1. To the nondeterministic finite automaton we
add one epsilon transition from the initial
state to a82a160a159 , for each nonterminal a106 . This
means that from the initial state we may
recognize an arbitrary phrase generated by
some nonterminal from the grammar.
2. After the training phase of the weighted
(minimal deterministic) automaton, all tran-
sitions that have not been visited obtain a
fixed high (but finite) weight. This means
that such transitions are only applied if all
others fail.
3. The driver of the automaton is changed so
that it restarts at the initial state when it gets
stuck at some input word, and when neces-
sary, that input word is deleted. The out-
put string with the lowest weight obtained
so far (preferably attached to final states, or
to other states with outgoing transitions la-
belled by input symbols) is then concate-
nated with the output string resulting from
processing subsequent input.
6 Experiments
We have investigated a corpus of En-
glish/Japanese sentence pairs, related by
hierarchical alignment (see also (Bangalore and
Riccardi, 2001)). We have taken the first 500,
1000, 1500, . . . aligned sentence pairs from this
corpus to act as training corpora of varying sizes;
we have taken 300 other sentence pairs to act as
test corpus.
We have constructed a bilexical transduction
grammar from each training corpus, in the form
of a context-free grammar, and this grammar was
approximated by a finite automaton. The input
sentences from the test corpus were then pro-
cessed by context-free and finite-state machin-
ery (in the sequel referred to by cfg and fa, re-
spectively). We have also carried out experi-
ments with robust finite-state processing, as dis-
cussed in Section 5, which is referred to by ro-
bust fa. If we append 2 after a tag, this mean
that a133a68a134a14a136a6a137 a2 a42 a7 a42a98a97 a20 a23 a106 a48a3a9a5a8a74 a42 a7a19a3a11a10a14a74 a42a98a97a144a49 , otherwise
a133a68a134a70a136a9a137
a2
a42
a7
a42 a97
a20 a23
a106
a48a3a6a5a8a74
a42a50a49 (see Section 2.3).
The reorder operators from the resulting out-
put strings were applied in a robust way as ex-
plained in Section 5. The output strings were
then compared to the reference output from the
corpus, resulting in Figure 1. Our metric is word
accuracy, which is based on edit distance. For a
pair of strings, the edit distance is defined as the
minimum number of substitutions, insertions and
deletions needed to turn one string into the other.
The word accuracy of a string a3 with regard to a
string a173 is defined to be a30a13a83a175a174
a72
, where a125 is the edit
distance between a3 and a173 and a22 is the length of
a173 .
To allow a comparison with more established
techniques (see e.g. (Bangalore and Riccardi,
2001)), we also take into consideration a simple
bigram model, trained on the strings comprising
both source and target sentences and reorder oper-
ators, as explained in Section 4. For the purposes
of predicting output symbols, a series of consecu-
tive target symbols and reorder operators follow-
ing a source symbol in the training sentences are
treated as a single symbol by the bigram model,
and only those may be output after that source
symbol. Since our construction is such that target
symbols always follow source symbols they are a
translation of (according to the automatically ob-
tained hierarchical alignment), this modification
to the bigram model prevents output of totally un-
related target symbols that could otherwise result
from a standard bigram model. It also ensures that
a bounded number of output symbols per input
symbol are produced.
The fraction of sentences that were transduced
(i.e. that were accepted by the grammar or the
automaton), is indicated in Figure 2. Since ro-
bust fa(2) and bigram are able to transduce all
input, they are not represented here. Note that the
average word accuracy is computed only with re-
spect to the sentences that could be transduced,
which explains the high accuracy for small train-
ing corpora in the cases of cfg(2) and fa(2),
where the few sentences that can be transduced
are mostly short and simple.
Figure 3 presents the time consumption of
transduction for the entire test corpus. These
data support our concerns about the high costs of
context-free processing, even though our parser
relies heavily on lexicalization.4
Figure 4 shows the sizes of the automata after
determinization and minimization. Determiniza-
tion for the largest automata indicated in the Fig-
ure took more than 24 hours for both fa(2) and
robust fa(2) , which suggests these methods be-
come unrealistic for training corpus sizes consid-
erably larger than 10,000 bitexts.
7 Conclusions
For our application, context-free transduction has
a relatively high accuracy, but it also has a high
time consumption, and it may be difficult to ob-
tain robustness without further increasing the time
costs. These are two major obstacles for use in
spoken language systems. We have tried to ob-
tain a rational transduction that approximates a
4It uses a trie to represent productions (similar to ELR
parsing (Nederhof, 1994)), postponing generation of output
for a production until all nonterminals and all input symbols
from the right-hand side have been found.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000 6000 7000 8000
word accuracy
a176
training corpus size
cfg2
cfg
fa2
fa
bigram
robust_fa2
robust_fa
Figure 1: Average word accuracy for transduced sentences.
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000
accepted
a177
training corpus size
fa
fa2
cfg
cfg2
Figure 2: Fraction of the sentences that were transduced.
context-free transduction, preserving some of its
accuracy.
Our experiments show that the automata we ob-
tain become very large for training corpora of in-
creasing sizes. This poses a problem for deter-
minization. We conjecture that the main source of
the excessive growth of the automata lies in noise
in the bitexts and their hierarchical alignments. It
is a subject for further study whether we can re-
duce the impact of this noise, e.g. by clustering of
source symbols, or by removing some infrequent,
idiosyncratic rules from the obtained transduction
grammar. Also, other methods of regular approx-
imation of context-free grammars may be consid-
ered.
In comparison to a simpler model, viz. bi-
grams, our approximating transductions do not
have a very high accuracy, which is especially
worrying since the off-line costs of computation
are much higher than in the case of bigrams. The
relatively low accuracy may be due to sparse-
ness of data when attaching weights to transitions:
the size of the minimal deterministic automaton
grows much faster than the size of the training
corpus it is constructed from, and the same train-
ing corpus is used to train the weights of the tran-
sitions of the automaton. Thereby, many transi-
tions do not obtain accurate weights, and unseen
input sentences are not translated accurately.
The problems described here may be avoided
by leaving out the determinization of the automa-
ton. This however leads to two new problems:
training of the weights requires more sophisti-
cated algorithms, and we may expect an increase
in the time needed to transduce input sentences,
since now both source and target symbols give
0
200000
400000
600000
800000
1e+06
1.2e+06
1.4e+06
1.6e+06
0 1000 2000 3000 4000 5000 6000 7000 8000
time (msec)
a178
training corpus size
cfg2
cfg
robust_fa2
robust_fa
fa2
fa
Figure 3: Time consumption of transduction.
rise to nondeterminism. Whether these problems
can be overcome requires further study.
Acknowledgements
This work is a continuation of partly unpub-
lished experiments by Srinivas Bangalore, which
includes regular approximation of grammars ob-
tained from hierarchical alignments. Many ideas
in this paper originate from frequent discus-
sions with Hiyan Alshawi, Srinivas Bangalore
and Mehryar Mohri, for which I am very grate-
ful.

References
A.V. Aho and J.D. Ullman. 1972. Parsing, volume 1
of The Theory of Parsing, Translation and Compil-
ing. Prentice-Hall.
H. Alshawi, S. Bangalore, and S. Douglas. 2000.
Learning dependency translation models as collec-
tions of finite-state head transducers. Computa-
tional Linguistics, 26(1):45–60.
S. Bangalore and G. Riccardi. 2001. A finite-state
approach to machine translation. In 2nd Meeting of
the North American Chapter of the ACL, Pittsburgh,
PA, June.
P.F. Brown et al. 1990. A statistical approach to
machine translation. Computational Linguistics,
16(2):79–85.
J. Eisner and G. Satta. 1999. Efficient parsing for
bilexical context-free grammars and head automa-
ton grammars. In 37th Annual Meeting of the ACL,
pages 457–464, Maryland, June.
P.M. Lewis II and R.E. Stearns. 1968. Syntax-directed
transduction. Journal of the ACM, 15(3):465–488.
M. Mohri and M.-J. Nederhof. 2001. Regular approx-
imation of context-free grammars through transfor-
mation. In J.-C. Junqua and G. van Noord, editors,
Robustness in Language and Speech Technology,
pages 153–163. Kluwer Academic Publishers.
M.-J. Nederhof. 1994. An optimal tabular parsing al-
gorithm. In 32nd Annual Meeting of the ACL, pages
117–124, Las Cruces, New Mexico, June.
M.-J. Nederhof. 2000. Practical experiments with
regular approximation of context-free languages.
Computational Linguistics, 26(1):17–44.
C. Tillmann and H. Ney. 2000. Word re-ordering and
DP-based search in statistical machine translation.
In The 18th International Conference on Compu-
tational Linguistics, pages 850–856, Saarbr¨ucken,
July–August.
J.M. Vilar et al. 1999. Text and speech translation by
means of subsequential transducers. In A. Kornai,
editor, Extended finite state models of language,
pages 121–139. Cambridge University Press.
H. Watanabe, S. Kurohashi, and E. Aramaki. 2000.
Finding structural correspondences from bilingual
parsed corpus for corpus-based translation. In The
18th International Conference on Computational
Linguistics, pages 906–912, Saarbr¨ucken, July–
August.
K. Yamamoto and Y. Matsumoto. 2000. Acquisition
of phrase-level bilingual correspondence using de-
pendency structure. In The 18th International Con-
ference on Computational Linguistics, pages 933–
939, Saarbr¨ucken, July–August.
