Machine Translation with Grammar Association:
Some Improvements and the Loco C Model
Federico Prat
Departamento de Lenguajes y Sistemas Inform´aticos
Universitat Jaume I de Castell´o
E-12071 Castell´on de la Plana, Spain
fprat@lsi.uji.es
Abstract
Grammar Association is a technique for
Machine Translation and Language Un-
derstanding introduced in 1993 by Vi-
dal, Pieraccini and Levin. All the sta-
tistical and structural models involved
in the translation process are automat-
ically built from bilingual examples,
and the optimal translation of new sen-
tences can be efficiently found by Dy-
namic Programming algorithms. This
paper presents and discusses Grammar
Association state of the art, including a
new statistical model: Loco C.
1 Introduction
Grammar Association is a promising technique
for facing Machine Translation and Language
Understanding tasks,1 first proposed by Vidal,
Pieraccini, and Levin (1993). This technique
combines statistical and structural models, all of
which can be automatically built from a set of
bilingual sentence pairs. Moreover, the optimal
translation of new input sentences can be effi-
ciently found by Dynamic Programming algo-
rithms.
Basically, a Grammar Association system con-
sists of three models: (1) an input grammar mod-
elling the input language of the translation task;
(2) an output grammar modelling its output lan-
guage; (3) an association model describing how
the use of certain elements (rules) of the input
1We view Language Understanding as a particular case
of Machine Translation where the output language is aimed
at representing the meaning of input sentences.
grammar is related (in the translation task) to the
use of their corresponding elements in the output
grammar. Using these models, the system per-
forms the translation of input sentences as fol-
lows: (1) first, the input sentence is parsed using
the input grammar, giving rise to an input deriva-
tion; (2) given the input derivation, the associa-
tion model assigns a weight to each rule of the
output grammar; (3) in the (now weighted) output
grammar, a search for the optimal output deriva-
tion is carried out; (4) the sentence associated to
that derivation is conjectured as translation of the
input sentence.
We are interested in designing Machine Trans-
lation systems based on the principles of Gram-
mar Association and within a statistical frame-
work. Some steps we have taken towards this final
end are presented in this work.
2 Grammar Association into a statistical
framework
In most of the papers describing statistical ap-
proaches to Machine Translation, Bayes’ rule is
applied giving rise to the following Fundamental
Equation,
a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a16a15a18a17a19a11a21a20
a22a24a23a14a25a27a26a29a28
a13a30a3a31a0a33a32a21a5a34a7
a9a12a11a14a13a16a15a18a17a19a11a21a20
a22a24a23a14a25 a26a29a28
a13a30a3a31a0a35a7a37a36
a28
a13a30a3a38a5a39a32a21a0a35a7a41a40
meaning that the optimal translation a0 a1 of an in-
put sentence a5 , the most probable sentence a0 in
the output language a42a4a43 given a5 a44 a42a10a45 , can be
found by maximizing the product of two factors:
a46 The a priori probability of the output sen-
tence,
a28
a13a30a3a31a0a47a7 . In practice, it is computed by
using a statistical model of the output lan-
guage a42a4a43 .
a46 The conditional probability
a28
a13a37a3a38a5a39a32a14a0a35a7 of
the input sentence a5 , given the output one
a0 . In practice, it is computed by using a sta-
tistical model of the reverse translation pro-
cess.
This decomposition has the advantage of modu-
larity in the modelling. An ad hoc statistical lan-
guage model encapsulates the features that are in-
herent to the output language, while the reverse
translation model can focus on relations between
input and output words, assigning scores to sen-
tence pairs without taking into account if the out-
put sentence is well-formed.2 An alternative, di-
rect statistical approach with a model for comput-
ing
a28
a13a48a3a49a0a50a32a21a5a34a7 seems to require this single model
to be complex enough to assign high scores only
to pairs where the output sentence verifies two
conditions: it is well-formed and means the same
that the input one. Hence, for the sake of sim-
plified modelling, Bayes’ decomposition has be-
come a typical choice in Machine Translation.
However, in the Grammar Association context,
when developing (using Bayes’ decomposition)
the basic equations of the system presented in (Vi-
dal et al., 1993), it is said that the reverse model
for
a28
a13a37a3a38a5a39a32a21a0a35a7 “does not seem to admit a sim-
ple factorization which is also correct and con-
venient”, so “crude heuristics” were adopted in
the mathematical development of the expression
to be maximized. We are going to show that, by
means of a direct modelling, Grammar Associa-
tion can be set into a rigorous statistical frame-
work without renouncing a convenient factoriza-
tion for the search of the optimal translation to be
efficient. Moreover, the main advantage of Bayes’
decomposition, modularity, is inherently present
in Grammar Association systems: relations be-
tween input and output are mainly modelled by
a (direct) statistical association model and struc-
tural features of the output language are modelled
by a grammar, which restricts the search space for
the best translation.
2Note that model behaviour for syntactically incorrect
input sentences is not important because input sentence is
known and the search is just over the output language.
Let us begin assuming there are unambiguous
grammars a51a52a45 and a51a53a43 describing, respectively, the
input language a42a4a45 and the output one a42a10a43 . Thus,
there is a one-to-one correspondence in each lan-
guage relating sentences to their derivations and
we can write
a28
a13a48a3a6a0a33a32a54a5a34a7a55a9
a28
a13a30a3a31a56a19a57
a26
a3a31a0a35a7a58a32a14a56a19a57a60a59a61a3a31a5a34a7a62a7a41a40
where a56 a57 a3a49a63a10a7 denotes the only derivation of sen-
tence a63 in grammar a51 . Moreover, let us suppose
the output grammar is context-free and rewrit-
ing probability of an output non-terminal using a
certain rule is independent of which other output
rules have been employed in the output deriva-
tion. Then, it follows that the probability of an
output derivation a56 a43 given an input one a56 a45 can
be expressed as
a28
a13a48a3a49a56
a43
a32a64a56
a45
a7a4a9 a65
a66
a26 a23a14a67 a26
a28
a13a48a3a6a68
a43
a32a70a69a72a71a6a73a75a74a76a3a38a68
a43
a7a77a40a78a56
a45
a7a41a40
with a term in the sum for each participation of a
rule a68 a43 in the derivation a56 a43 , and a69a79a71a31a73a80a74a76a3a38a68 a43 a7 denoting
the left-hand side non-terminal of that rule. So,
finally, we can find the most probable translation
a0 a1 a3a31a5a8a7 of an input sentence a5 as the sentence
associated to the output derivation given by
a11a14a13a78a15a81a17a19a11a21a20
a67a48a26a75a23a14a82a58a83
a57
a26a78a84
a65
a66
a26 a23a64a67 a26
a28
a13a48a3a31a68
a43
a32a70a69a79a71a31a73a80a74a60a3a38a68
a43
a7a77a40a78a56 a57 a59a27a3a38a5a8a7a16a7a41a40
where a85
a3
a51a53a43
a7 stands for the set of all possible
derivations in a51 a43 .
In practice, input and output grammars will be
approximations inferred from samples and, more
specifically, they will be acyclic finite-state au-
tomata. The restriction from context-free gram-
mars to regular ones is due to the wide availabil-
ity of inference techniques for these formal ma-
chines and to computational convenience. On the
other hand, the output grammar has to be acyclic
because of a more subtle point: the most prob-
able derivation in the grammar will never make
use of a cycle (no matter how high its probability
is, avoiding the cycle always makes the deriva-
tion more probable). Hence, if we allowed the in-
ference algorithm to model some features of the
output language using cycles, system translations
would never exhibit such features. Finally, for
the sake of homogeneity, we choose to force in-
put grammar to be acyclic too.
We can conclude this section saying that, in-
ferring deterministic and acyclic finite-state au-
tomata, if we are able to learn association models
for estimating, for each output rule, the probabil-
ity of using that rule conditioned on having em-
ployed its left-hand side and the identity of the
input derivation, then an efficient Dynamic Pro-
gramming search for the optimal output deriva-
tion3 can be used in order to provide the most
probable translation.
3 Using ECGI language models
The ECGI algorithm (Rulot and Vidal, 1987) is
a heuristic technique for the inference of acyclic
finite-state automata from positive samples, and
determinism can be imposed a posteriori by a
well-known transformation for regular grammars.
Therefore, in principle, ECGI provides exactly
the kind of language model Grammar Association
needs. Moreover, it was (without imposing deter-
minism) the inference technique employed in (Vi-
dal et al., 1993).
Informally, ECGI works as follows. With the
first sample sentence, it builds an initial automa-
ton consisting in a linear path representing the
sentence. Words label states (instead of arcs) and
there are two special non-labelled states: the ini-
tial one and the final one. For each new sentence,
if it is already recognized by the automaton built
so far, nothing happens; otherwise, if the current
model does not recognize the sentence, new arcs
and states are added to the most suitable path (ac-
cording to a minimum-cost criterion) for recogni-
tion to be possible. In a sense, it is like construct-
ing a new path for the new sentence and then find-
ing a maximal merge with a path in the automa-
ton.
For further discussion on some features of the
ECGI algorithm, let us first consider the following
set of five sentences: (1) "some snakes eat
rats"; (2) "some people eat snakes";
(3) "some people eat rats"; (4) "some
people are dangerous"; (5) "snakes
are dangerous". Figure 1 shows how ECGI
3Obviously, any algorithm for finding the minimum-cost
path in a graph is applicable.
BEGIN some ENDsnakes eat rats
(a) "some snakes eat rats"
BEGIN some END
snakes
eat
people
rats
snakes
(b) "some people eat snakes"
BEGIN some
ENDsnakes eat
people
rats
snakes
are dangerous
(c) "some people are dangerous"
BEGIN
some
snakes
END
snakes eat
people
rats
snakes
are dangerous
(d) "snakes are dangerous"
Figure 1: The ECGI algorithm: an example.
incrementally builds an automaton able to recog-
nize the whole training set and, moreover, per-
forms some generalizations. For instance, af-
ter considering the two first sentences (subfig-
ure b), two more sentences are also represented in
the current automaton: "some snakes eat
snakes" and "some people eat rats".
Thus, when this last sentence is actually presented
to the algorithm, there is no need for the automa-
ton to be updated. On the contrary, sentences 4
and 5 imply the addition of new elements and
the finally inferred automaton is the one shown
in subfigure d.
Though successful application of ECGI to a
variety of tasks has been reported,4 the method
4For instance, ECGI has been applied to problems as
different as speech understanding (Prieto and Vidal, 1992),
hand-written digit recognition (Vidal et al., 1995), and music
composition (Cruz and Vidal, 1997)
BEGIN
snakessome ENDeat
arepeople
rats
snakes
dangerous
Figure 2: An alternative automaton.
suffers from some drawbacks. For instance,
the level of generalization is sometimes lower
than expected. In the example presented in
Figure 1, when "snakes are dangerous"
is employed for updating the model in subfig-
ure c, instead of adding a new state and two
arcs to the path corresponding to "some peo-
ple are dangerous", the solution in Fig-
ure 2 seems to be an appealing alternative: adding
just two arcs, more reasonable generalization is
obtained. Nevertheless, ECGI chooses the solu-
tion in Figure 1 because it searches for just one
path to be modified with a minimal number of
new elements, and does not take into account
combinations of different paths.
On the other hand, ECGI can suffer from in-
adequate generalization, especially at early stages
of the incremental construction of the automa-
ton. If "some people eat snakes" and
"snakes are dangerous" were the first
two sentences presented to ECGI, the algorithm
would try to make use of the state "snakes"
of the initial model for representing the oc-
currence of that word in the second sentence,
leading to an automaton which would recognize
“sentences” as"some people eat snakes
are dangerous", or simply "snakes". The
situation that produces this kind of undesired be-
haviour of the method is characterized by the con-
fluence of a couple of circumstances: a word in a
new sentence is also present in the current model,
but with a different function, and that automaton
has not enough adequate structural information
for offering a better merging to the new sentence.
As pointed out by Prieto and Vidal (1992), a
proper ordering of the set of sentences presented
to ECGI can provide more compact models, and
we think that better ones too. The ordering we
propose here simply follows, first, a decreasing-
length criterion and then, for breaking ties, ap-
plies any dictionary-like ordering. Thus, we try
to avoid the problem discussed in the previous
paragraph by providing the inference algorithm
with as much as possible structural information at
first stages of automaton construction and, more-
over, dictionary-like ordering inside each length
is aimed at frequently presenting to ECGI new
sentences that are similar to the previous ones.
Furthermore, a very common way to reduce
the complexity of problems involving languages
is the definition of word categories, which can
be manually designed or automatically extracted
from data (Martin et al., 1995). We think catego-
rization helps in solving the problem of undesired
merges and also in increasing the generalization
abilities of ECGI. In order to illustrate this point,
let us consider a category a86 animalsa87 consisting
of words "snakes", "rats" and "people"
in the very simple example of Figure 1. Words
can be substituted for the appropriate category
in the original sentences; then, the modified sen-
tences are presented to the inference algorithm;
finally, categories in the automaton are expanded.
Figure 3 shows the automata that are successively
built in that process.
As said at the beginning of this section, deter-
minism must be imposed a posteriori for the lan-
guage models to fit our formal framework. In ad-
dition, we will apply them a minimization process
in order to simplify the problem that the corre-
sponding association model will have to solve.
4 Loco C: A new association model
Following a data-driven approach, a Grammar
Association system needs to learn from exam-
ples an association model capable to estimate the
probabilities required by our recently developed
framework, that is, the probability of each rule
in the grammar that models the output language,
conditioned on its left-hand side and the deriva-
tion of the input sentence.
Among the different association models we
have studied (Prat, 1998), it is worth emphasizing
one we have specifically developed for playing
that role in Grammar Association systems: the
Loco C model. We based our design on the IBM
models 1 and 2 (Brown et al., 1993), but taking
into account that our model must generate cor-
rect derivations in a given grammar, not any se-
BEGIN some END<animals> eat <animals>
(a) "some a88 animalsa89 eat a88 animalsa89 "
BEGIN some END<animals>
eat
are
<animals>
dangerous
(b) "some a88 animalsa89 are dangerous"
BEGIN <animals>
some
END
eat
are
<animals>
dangerous
(c) "a88 animalsa89 are dangerous"
BEGIN
snakes
rats
people
some
END
eat
are
snakes
rats
people
dangerous
(d) Expansion of a88 animalsa89
Figure 3: Using a category a86 animalsa87 for
"snakes", "rats" and "people" in the ex-
ample of Figure 1.
quence of rules.5 Moreover, we wanted to model
the probability estimation for each output rule
as an adequately weighted mixture,6 along with
keeping the maximum-likelihood re-estimation of
its parameters within the growth transformation
framework (Baum and Eagon, 1967; Gopalakr-
5In those simple IBM translation models, an output se-
quence (of words) is randomly generated from a given in-
put one by first choosing its length and then, for each posi-
tion in the output sequence, independently choosing an ele-
ment (word). If the relation between input and output deriva-
tions (sequences of rules) has to be explicitly modelled, the
choices of output elements can no longer be independent be-
cause a rule is only applicable if its left-hand side has just
appeared in the output derivation.
6In IBM models, all words in the input sequence have
the same influence in the random choice of output words
(model 1) or they have a relative influence depending on
their positions (model 2). In the case of derivations, we are
interested in modelling those relative influences taking into
account rule identities (instead of rule positions).
ishnan et al., 1991). After exploring some similar
alternatives (and discarding them because of their
poor results in a few translation experiments),
Loco C was finally defined as explained below.7
The Loco C model assumes a random gener-
ation process (of an output derivation, given an
input one) which begins with the starting symbol
of the output grammar as the “current sentential
form” and then, while the current sentential form
contains a non-terminal, iteratively performs the
following sequence of two random choices: in
Choice 1, one of the rules in the input derivation is
chosen; in Choice 2, the non-terminal in the cur-
rent sentential form is rewritten using a randomly
chosen rule of the output grammar.
The behaviour of the model depends on two
kinds of parameters, each one guiding one of the
choices mentioned above. Formally, given an in-
put derivation a56 a45 and an output non-terminal a90 a43
to be rewritten, the probability of an input rule
a68
a45
a44a91a56
a45 to be chosen in Choice 1 depends on pa-
rameters of the form a92 a3a38a68 a45 a32 a90a76a43 a7 and can be ex-
pressed as
a92
a3a38a68
a45
a32
a90 a43
a7
a93
a66a95a94
a59
a23a14a67
a59
a92a8a96
a68a14a97
a45
a32
a90 a43a99a98a2a100
On the other hand, once a particular input rule
a68
a45 is chosen, the probability of an output rule
a68
a43 whose left-hand side is a90a76a43 to be chosen in
Choice 2 is directly given by a parameter of the
form a101 a3a31a68 a43 a32a54a68 a45 a7 . Hence,
a28
a13a30a3a38a68
a43
a32a27a69a72a71a6a73a75a74a76a3a38a68
a43
a7a77a40a78a56
a45
a7
takes in Loco C the form
a93
a66
a59
a23a64a67
a59
a92
a3a38a68
a45
a32a70a69a72a71a6a73a75a74a76a3a38a68
a43
a7a62a7
a93
a66a95a94
a59
a23a14a67
a59
a92a34a96
a68a14a97
a45
a32a70a69a72a71a6a73a75a74a76a3a38a68
a43
a7
a98
a36
a101
a3a31a68
a43
a32a14a68
a45
a7
of a weighted mixture depending on two kinds of
trainable parameters:
a46
a92
a3a31a68
a45
a32
a90 a43
a7 : Measures the importance of a68
a45 in
choosing an adequate rewriting rule for a90 a43 .8
7Full details on the discarded models, Loco 1, Loco A,
and Loco B, can be found (in Spanish) in pages 52–60
of (Prat, 1998).
8Note that learning these parameters performs a sort of
“automatic variable selection” of the input rules that are rel-
evant for discriminatively choosing among the next applica-
ble output rules.
MLA Task
Spanish: "un c´ırculo oscuro est´a
encima de un c´ırculo"
English: "a dark circle is above a
circle"
Spanish: "se elimina el cuadrado os-
curo que est´a debajo del
c´ırculo y del tri´angulo"
English: "the dark square which is
below the circle and the
triangle is removed"
Simplified Tourist Task
Spanish: "nos vamos a ir el d´ıa diez
a la una de la tarde."
English: "we are leaving on the tenth
at one in the afternoon."
Spanish: "¿puedo pagar la cuenta con
dinero en efectivo?"
English: "can I pay the bill in
cash?"
Figure 4: Examples of sentence pairs.
a46
a101
a3a38a68
a43
a32a54a68
a45
a7 : Measures how much a68
a45 agrees in
using the rule a68 a43 .
Consequently, the corresponding likelihood func-
tion is not polynomial, but rational, so Baum-
Eagon inequality (1967) cannot be applied and
Gopalakrishnan et al. inequality (1991) must
be used, instead, in order to develop a Loco C
model re-estimation algorithm based on growth
transformations. Fortunately, both the computa-
tional complexity of the resulting re-estimation
algorithm (same order as with IBM model 1) and
the experimental results are satisfactory.
5 Experimental results
In a first series of experiments, we were interested
in knowing whether or not our proposals actually
improve Grammar Association state of the art. To
this end, a simple artificial Machine Translation
task was employed. The corpus consists of pairs
of sentences describing two-dimensional scenes
with circles, squares and triangles in Spanish and
English (some examples can be found in Figure 4,
where the task is referred to as MLA Task). There
are a102a64a103 words in the Spanish vocabulary and a102a64a104 in
Table 1: Results of an English-to-Spanish trans-
lation experiment with the original Grammar As-
sociation system, using a105a99a106 ,a106a64a106a64a106 pairs of the MLA
Task for training and a102a14a106a64a106 for testing.
Sentence Minimum Length Correct
Sorting Deterministic Constraint Translations
No No No a107a109a108a54a110a111a109a112
No No Yes a107a75a107a113a110a114a75a112
No Yes No a115a80a111a116a110a111a109a112
No Yes Yes a115a80a117a116a110a114a75a112
Yes No No a107a80a118a116a110a111a109a112
Yes No Yes a107a80a117a116a110a114a75a112
Yes Yes No a115a75a115a113a110a111a109a112
Yes Yes Yes a115a75a115a113a110a111a109a112
the English one.
Let us begin considering English-to-Spanish
translation, with a105a99a106 ,a106a64a106a64a106 pairs for training the sys-
tems and a102a14a106a64a106 different ones for testing purposes.
We carefully implemented the original Grammar
Association system described in (Vidal et al.,
1993), tuned empirically a couple of smoothing
parameters, trained the models and, finally, ob-
tained an a119a21a120
a100
a104a122a121 of correct translations.9 Then,
we studied the impact of: (1) sorting, as proposed
in Section 3, the set of sentences presented to
ECGI; (2) making language models deterministic
and minimum; (3) constraining the best transla-
tion search to those sentences whose lengths have
been seen, in the training set, related to the length
of the input sentence. As shown in Table 1, all
the proposed measures were beneficial and we got
a final a103a64a103
a100
a104a122a121 of correct translations (that is, just
one translation was wrong). Hence, we decided to
apply those measures to all our Grammar Asso-
ciation systems and, in particular, to our Loco C
one. This system, after tuning some minor param-
eters (for instance, the number of re-estimation it-
erations for the model was fixed to a104a14a106a64a106 ), got a
a103a64a103
a100
a106a61a121 of correct translations.
Then, in order to further compare our two sys-
tems (which will be referred to as IOGA, for Im-
proved Original Grammar Association, and sim-
ply Loco C) without more manual tuning, both
were tested with a105 ,a106a64a106a64a106 new sentence pairs: in
this case, IOGA got a a103a64a103
a100
a120a27a121 and Loco C got
9For each bilingual sentence pair
a123a72a124a53a125a31a126a128a127 employed for
testing a system, we consider that the system achieves a cor-
rect translation only if it produces exactly the sentence a126 as
output when it is provided with the sentence a124 as input.
a a103a64a103
a100
a103a122a121 .
In a second series of experiments, we wanted
to compare our best system, Loco C, with Re-
ConTra, the recurrent connectionist system de-
scribed in (Casta˜no and Casacuberta, 1997),
where a a103a64a119
a100
a120a27a121 of correct translations is reported
on the Spanish-to-English MLA Task with just
a129 ,
a106a64a106a64a106 pairs for training. In the same conditions,
Loco C got a a103a64a102
a100
a119a122a121 of correct translations on a
a105 ,a106a64a106a64a106 pair test set (IOGA, just an a119a130a105
a100a132a131
a121 ).
Since the MLA Task is an artificial task where
each language can be exactly modelled by an
acyclic finite-state automaton, we decided to use
those exact automata in our systems in order to
measure the impact of perfect language mod-
elling. In this case, Loco C reached perfect re-
sults (a105a99a106a64a106
a100
a106a61a121 ), while IOGA got a a103a64a104
a100
a106a61a121 . As a
conclusion to this second series of experiments,
we can point out that our systems are quite sensi-
tive to the quality of language models and, also,
that Loco C is a very good association model.
Our last series of experiments were carried out
on a different, more complex task (but artificial
too). It was extracted from the task defined for the
first phase of the EUTRANS project (Amengual et
al., 1996) and covers just a small subset of the
situations tourists can face when leaving hotels
(some examples can be found in Figure 4, where
the task is referred to as Simplified Tourist Task).
There are a105a113a133a14a119 words in the Spanish vocabulary
and a105a80a120a122a106 in the English one. We defined a stan-
dard scenario in which Spanish-to-English trans-
lation must be performed on a105 ,a106a64a106a64a106 sentences af-
ter training the corresponding models with a104 ,a106a64a106a64a106
pairs.
In that scenario, Loco C achieved an a119a14a106
a100
a119a122a121
of correct translations, where errors are mainly
due to lack of coverage in the language models,
especially in the input one: only a119a64a104
a100
a133a64a121 of the
Spanish sentences in the test set could be correctly
parsed with the inferred model, so we decided to
apply word categories to improve the generaliza-
tion capabilities of ECGI as exemplified in Sec-
tion 3. Using automatic categorization (Martin et
al., 1995) for extracting a133a14a104 Spanish word classes
and a104a14a106 English ones, the resulting language mod-
els achieved perfect coverage and the Loco C
system performance increased to a103a64a119
a100
a106a61a121 .
In order to put the previous figure into con-
text, it is worth saying that the best result obtained
by ReConTra in the same scenario was a103a130a105
a100
a105a113a121 .
On the other hand, combining automatic bilin-
gual categorization and Subsequential Transduc-
ers as described in (Barrachina and Vilar, 1999),
a a103a64a119
a100
a120a27a121 of correct translations can be achieved
for an adequate choice of the number of word
classes (
a131
a106 ), though only a
a131
a119
a100
a133a64a121 is obtained by
the same system in the absence of categorization.
6 Concluding remarks
Our work presents a set of improvements on pre-
vious state of the art of Grammar Association:
first, by providing better language models to the
original system described in (Vidal et al., 1993);
second, by setting the technique into a rigorous
statistical framework, clarifying which kind of
probabilities have to be estimated by association
models; third, by developing a novel and espe-
cially adequate association model: Loco C.
On the other hand, though experimental results
are quite good, we find them particularly relevant
for pointing out directions to follow for further
improvement of the Grammar Association tech-
nique. One of these directions consists in explor-
ing better language models, refining the catego-
rization methods employed in this work or substi-
tuting ECGI for some kind of merge-based infer-
ence algorithm (Thollard et al., 2000). Exploiting
data-driven bilingual categorization (Barrachina
and Vilar, 1999) is another promising way to im-
prove the performance of our system.
Finally, let us say that, obviously, the experi-
mental results on simple artificial tasks presented
in this work are not intended for convincing the
reader that our Grammar Association systems
could obtain similar performances on complex
tasks as, for instance, the Hansards (the bilin-
gual proceedings of the Canadian parliament).
Our controlled experiments were mainly aimed
at showing that our proposals improve Gram-
mar Association, along with comparing this tech-
nique with a couple of different ones and pro-
viding easy-to-analyse results. For these simple
purposes, we find our experimental work ade-
quate. However, natural translation tasks should
be faced soon, in the next stage of our research.
This implies, for instance, trying to cope with se-
vere data sparseness. In this regard, we are op-
timistic: on one hand, because we trust in bilin-
gual categorization for reducing the negative ef-
fects of sparseness (Vilar et al., 1995); on the
other hand, because some additional experiments
carried out with Grammar Association systems on
the Spanish-to-English MLA Task with just a104a14a106a64a106
pairs for training show acceptable results. For in-
stance, our Loco C achieved an a119a64a119
a100
a129
a121 of cor-
rect translations10 while, in the same scenario,
ReConTra performance drops to a104 a129
a100
a105a113a121 (Casta˜no
and Casacuberta, 1997).
Acknowledgements
Most of the work presented here was carried
out under the kind supervision of Dr. Francisco
Casacuberta, and the author want to express his
gratitude to him.
Furthermore, it is worth saying that this
work has been partially supported by grant
P1A99-10 from Fundaci´o Caixa Castell´o-
Bancaixa (NEUROTRAD project).

References
J. C. Amengual, J. M. Bened´ı, A. Casta˜no, A. Marzal,
F. Prat, E. Vidal, J. M. Vilar, C. Delogu, A. Di
Carlo, H. Ney, and S. Vogel. 1996. Definition of
a machine translation task and generation of cor-
pora. Technical Report D1, EuTrans (IT-LTR-OS-
20268).
S. Barrachina and J. M. Vilar. 1999. Bilingual cluster-
ing using monolingual algorithms. In Procs. of the
TMI’99, pages 77–87.
L. E. Baum and J. A. Eagon. 1967. An inequality with
applications to statistical estimation for probabilis-
tic functions of Markov processes and to a model
for ecology. Bulletin of the American Mathemati-
cal Society, 73:360–363.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Com-
putational Linguistics, 19(2):263–311.
M. A. Casta˜no and F. Casacuberta. 1997. A connec-
tionist approach to machine translation. In Procs.
of the EuroSpeech’97, volume 1, pages 91–94.
P. P. Cruz and E. Vidal. 1997. A study of grammatical
inference algorithms in automatic music composi-
tion. In Preprints of the VII National Symposium on
10In this experiment,
a134a62a117 Spanish word classes and a134a78a111 En-
glish ones were automatically extracted from the training
pairs in order to increase ECGI generalization capabilities.
Pattern Recognition and Image Analysis, volume 1,
pages 43–48.
P. S. Gopalakrishnan, D. Kanevsky, A. N´adas, and
D. Nahamoo. 1991. An inequality for ratio-
nal functions with applications to some statistical
problems. IEEE Trans. on Information Theory,
37(1):107–113.
S. Martin, J. Liermann, and H. Ney. 1995. Algorithms
for bigram and trigram word clustering. In Procs.
of the EuroSpeech’95.
F. Prat. 1998. Traducci´on autom´atica en domin-
ios restringidos: Algunos modelos estoc´asticos sus-
ceptibles de ser aprendidos a partir de ejemplos.
Ph.D. thesis, Depto. de Sistemas Inform´aticos y
Computaci´on, Universidad Polit´ecnica de Valencia
(Spain).
N. Prieto and E. Vidal. 1992. Learning language mod-
els through the ECGI method. Speech Communica-
tion, 11(2–3):299–309.
H. Rulot and E. Vidal. 1987. Modelling (sub)string-
length based constraints through a grammatical in-
ference method. In Pattern Recognition Theory and
Applications, volume F30 of NATO ASI, pages 451–
459. Springer-Verlag.
F. Thollard, P. Dupont, and C. de la Higuera. 2000.
Probabilistic DFA inference using Kullback-Leibler
divergence and minimality. In Procs. of the
ICML’2000, pages 975–982.
E. Vidal, R. Pieraccini, and E. Levin. 1993. Learning
associations between grammars: A new approach
to natural language understanding. In Procs. of the
EuroSpeech’93, pages 1187–1190.
E. Vidal, H. Rulot, J. M. Valiente, and G. An-
dreu. 1995. Application of the error-correcting
grammatical inference algorithm (ECGI) to planar
shape recognition. Technical Report DSIC-II/2/95,
Depto. de Sistemas Inform´aticos y Computaci´on,
Universidad Polit´ecnica de Valencia (Spain).
J. M. Vilar, A. Marzal, and E. Vidal. 1995. Learning
language translation in limited domains using finite-
state models: Some extensions and improvements.
In Procs. of the EuroSpeech’95, pages 1231–1234.
