Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 95–98,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Experiments Using MAR for Aligning Corpora∗
Juan Miguel Vilar
Departamento de Lenguajes y Sistemas Inform´aticos
Universitat Jaume I
Castell´on (Spain)
jvilar@lsi.uji.es
Abstract
We present some experiments conducted
within the context of one of the shared
tasks of the ACL 2005 Workshop on
Building and Using Parallel Texts. We
have employed a new model for finding
the alignments. This new model takes
a recursive approach in order to find the
alignments. As its computational costs are
quite high, a method for splitting the train-
ing sentences in smaller parts is used.
1 Introduction
We present the experiments we conducted within the
context of the shared task of the track on building
and using parallel texts for languages with scarce
resources of the ACL 2005 Workshop on Build-
ing and Using Parallel Texts. The aim of the task
was to align the words of sentence pairs in differ-
ent language pairs. We have participated using the
Romanian-English corpora.
We have used a new model, the MAR (from the
Spanish initials of Recursive Alignment Model) that
allowed us to find structured alignments that were
later transformed in a more conventional format.
The basic idea of the model is that the translation of
a sentence can be obtained in three steps: first, the
sentence is divided in two parts; second, each part
is translated separately using the same process; and
∗Work partially supported by Bancaixa through the project
“Sistemas Inductivos, Estad´ısticos y Estructurales, para la Tra-
ducci´on Autom´atica (SIEsTA)”.
third, the two translations are joined. The high com-
putational costs associated with the training of the
model made it necessary to split the training pairs in
smaller parts using a simple heuristic.
Initial work with this model can be seen in (Vi-
lar Torres, 1998). A detailed presentation can be
found in (Vilar and Vidal, 2005). This model shares
some similarities with the stochastic inversion trans-
duction grammars (SITG) presented by Wu in (Wu,
1997). The main point in common is the num-
ber of possible alignments between the two models.
On the other hand, the parametrizations of SITGs
and the MAR are completely different. The gen-
erative process of SITGs produces simultaneously
the input and output sentences and the parameters
of the model refer to the rules of the nontermi-
nals. This gives a clear symmetry to both input
and output sentences. Our model clearly distin-
guishes an input and output sentence and the pa-
rameters are based on observable properties of the
sentences (their lengths and the words composing
them). Also, the idea of splitting the sentences un-
til a simple structure is found in the Divisive Clus-
tering presented in (Deng et al., 2004). Again, the
main difference is in the probabilistic modeling of
the alignments. In Divisive Clustering a uniform dis-
tribution on the alignments is assumed while MAR
uses a explicit parametrization.
The rest of the paper is structured as follows: the
next section gives an overview of the MAR, then we
explain the task and how the corpora were split, after
that, how the alignments were obtained is explained,
finally the results and conclusions are presented.
95
2 The MAR
We provide here a brief description of the model,
a more detailed presentation can be found in (Vilar
and Vidal, 2005). The idea is that the translation of
a sentence ¯x into a sentence ¯y can be performed in
the following steps1:
(a) If ¯x is small enough, IBM’s model 1 (Brown et
al., 1993) is employed for the translation.
(b) If not, a cut point is selected in ¯x yielding two
parts that are independently translated applying
the same procedure recursively.
(c) The two translations are concatenated either in
the same order that they were produced or sec-
ond first.
2.1 Model parameters
Apart from the parameters of model 1 (a stochas-
tic dictionary and a discrete distribution of lenghts),
each of the steps above defines a set of parameters.
We will consider now each set in turn.
Deciding the submodel The first decision is
whether to use IBM’s model 1 or to apply the MAR
recursively. This decision is taken on account of the
length of ¯x. A table is used so that:
Pr(IBM | ¯x) ≈ MI(|¯x|),
Pr(MAR | ¯x) ≈ MM(|¯x|).
Clearly, for every ¯x we have that Pr(IBM | ¯x) +
Pr(MAR | ¯x) = 1.
Deciding the cut point It is assumed that the
probability of cutting the input sentence at a given
position b is most influenced by the words around it:
xb and xb+1. We use a table B such that:
Pr(b | ¯x) ≈ B(xb,xb+1)summationtext|¯x|−1
i=1 B(xi,xi+1)
.
That is, a weight is assigned to each pair of words
and they are normalized in order to obtaing a proper
probability distribution.
1We use the following notational conventions. A string or
sequence of words is indicated by a bar like in ¯x, individual
words from the sequence carry a subindex and no bar like in xi,
substrings are indicated with the first and last position like in ¯xji .
Finally, when the final position of the substring is also the last
of the string, a dot is used like in ¯x.i
Deciding the concatenation direction The direc-
tion of the concatenation is also decided as a func-
tion of the two words adjacent to the cut point, that
is:
Pr(D | b, ¯x) ≈ DD(xb,xb+1),
Pr(I | b, ¯x) ≈ DI(xb,xb+1),
where D stands for direct concatenation (i.e.
the translation of ¯xb1 will precede the transla-
tion of ¯x.b+1) and I stands for inverse. Clearly,
DD(xb,xb+1) + DI(xb,xb+1) = 1 for every
pair (xb,xb+1).
2.2 Final form of the model
With these parameters, the final model is:
pT(¯y | ¯x) =
MI(|¯x|)pI(¯y | ¯x)
+MM(|¯x|)
|¯x|−1summationdisplay
b=1
B(xb,xb+1)summationtext
|¯x|−1
i=1 B(xi,xi+1)
·
parenleftBigg
DD(xb,xb+1)
|¯y|−1summationdisplay
c=1
pT(¯yc1 | ¯xb1)pT(¯y.c+1 | ¯x.b+1)
+DI(xb,xb+1)
|¯y|−1summationdisplay
c=1
pT(¯y.c+1 | ¯xb1)pT(¯yc1 | ¯x.b+1)
parenrightBigg
were pI represents the probability assigned by
model 1 to a pair of sentences.
2.3 Model training
The training of the model parameters is done max-
imizing the likelihood of the training sample. For
each training pair (¯x, ¯y) and each parameter P rele-
vant to it, the value of
C(P) = Pp
T(¯y | ¯x)
∂ pT(¯y | ¯x)
∂ P (1)
is computed. This corresponds to the counts of P
in that pair. As the model is polynomial on all
its parameters except for the cuts (the B’s), Baum-
Eagon’s inequality (Baum and Eagon, 1967) guar-
antees that normalization of the counts increases the
likelihood of the sample. For the cuts, Gopalakr-
ishnan’s inequality (Gopalakrishnan et al., 1991) is
used.
96
Table 1: Statistics of the training corpus. Vocabulary
refers to the number of different words.
Language Sentences Words Vocabulary
Romanian 48 481 976 429 48 503
English 48 481 1 029 507 27 053
The initial values for the dictionary are trained
using model 1 training and then a series of itera-
tions are made updating the values of every param-
eter. Some additional considerations are taken into
account for efficiency reasons, see (Vilar and Vidal,
2005) for details.
A potential problem here is the large number of
parameters associated with cuts and directions: two
for each possible pair of words. But, as we are in-
terested only in aligning the corpus, no provision is
made for the data sparseness problem.
3 The task
The aim of the task was to align a set of 200 transla-
tion pairs between Romanian and English. As train-
ing material, the text of 1984, the Romanian Con-
stitution and a collection of texts from the Web were
provided. Some details about this corpus can be seen
in Table 1.
4 Splitting the corpus
To reduce the high computational costs of training of
the parameters of MAR, a heuristic was employed in
order to split long sentences into smaller parts with
a length less than l words.
Suppose we are to split sentences ¯x and ¯y. We
begin by aligning each word in ¯y to a word in ¯x.
Then, a score and a translation is assigned to each
substring ¯xji with a length below l. The translation is
produced by looking for the substring of ¯y which has
a length below l and which has the largest number
of words aligned to positions between i and j. The
pair so obtained is given a score equal to sum of: (a)
the square of the length of ¯xji; (b) the square of the
number of words in the output aligned to the input;
and (c) minus ten times the sum of the square of the
number of words aligned to a nonempty position out
of ¯xji and the number of words outside the segment
chosen that are aligned to ¯xji.
These scores are chosen with the aim of reduc-
ing the number of segments and making them as
“complete” as possible, ie, the words they cover are
aligned to as many words as possible.
After the segments of ¯x are so scored, the partition
of ¯x that maximizes the sum of scores is computed
by dynamic programming.
The training material was split in parts up to ten
words in length. For this, an alignment was obtained
by training an IBM model 4 using GIZA++ (Och and
Ney, 2003). The test pairs were split in parts up to
twenty words. After the split, there were 141945
training pairs and 337 test pairs. Information was
stored about the partition in order to be able to re-
cover the correct alignments later.
5 Aligning the corpus
The parameters of the MAR were trained as ex-
plained above: first ten IBM model 1 iterations were
used for giving initial values to the dictionary proba-
bilities and then ten more iterations for retraining the
dictionary together with the rest of the parameters.
The alignment of a sentence pair has the form of a
tree similar to those in Figure 1. Each interior node
has two children corresponding to the translation of
the two parts in which the input sentence is divided.
The leaves of the tree correspond to those segments
that were translated by model 1.
As the reference alignments do not have this kind
of structure it is necessary to “flatten” them. The
procedure we have employed is very simple: if we
are in a leaf, every output word is aligned to every
input word; if we are in an interior node, the “flat”
alignments for the children are built and then com-
bined. Note that the way leaves are labeled tends to
favor recall over precision.
The flat alignment corresponding to the trees of
Figure 1 are:
economia si finantele publice
economy and public finance
and
Winston se intoarse brusc .
Winston turned round abruptly .
97
economia si finantele publice
economy and public finance
economia si
economy and
finantele publice
public finance
economia
economy
si
and
finantele
finance
publice
public
Winston se intoarse brusc .
Winston turned round abruptly .
Winston se intoarse
Winston turned round
brusc .
abruptly .
Winston
Winston
se intoarse
turned round
brusc
abruptly
.
.
Figure 1: Two trees representing the alignment of two pair of sentences.
Precision Recall F-Measure AER
0.5404 0.6465 0.5887 0.4113
Table 2: Results for the task
6 Results and discussion
The results for the alignment can be seen in Ta-
ble 2. As mentioned above, there is a certain prefer-
ence for recall over precision. For comparison, us-
ing GIZA++ on the split corpus yields a precision
of 0.6834 and a recall of 0.5601 for a total AER
of 0.3844.
Note that although the definition of the task al-
lowed to mark the alignment as either probable or
sure, we marked all the alignments as sure, so pre-
cision and recall measures are given only for sure
alignments.
There are aspects that deserve further experimen-
tation. The first is the split of the original corpus.
It would be important to evaluate its influence, and
to try to find methods of using MAR without any
split at all. A second aspect of great importance is
the method used for “flattening”. The way leaves
of the tree are treated probably could be improved
if the dictionary probabilities were somehow taken
into account.
7 Conclusions
We have presented the experiments done using a
new translation model for finding word alignments
in parallel corpora. Also, a method for splitting the
input before training the models has been presented.

References
Leonard E. Baum and J. A. Eagon. 1967. An inequal-
ity with applications to statistical estimation for prob-
abilistic functions of Markov processes and to a model
for ecology. Bulletin of the American Mathematical
Society, 73:360–363.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. 1993. The mathemat-
ics of statistical machine translation: Parameter esti-
mation. Computational Linguistics, 19(2):263–311,
June.
Yonggang Deng, Shankar Kumar, and William Byrne.
2004. Bitext chunk alignment for statistical machine
translation. Research Note 50, CLSP Johns Hopkins
University, April.
P. S. Gopalakrishnan, Dimitri Kanevsky, Arthur N´adas,
and David Nahamoo. 1991. An inequality for ra-
tional functions with applications to some statistical
problems. IEEE Transactions on Information Theory,
37(1):107–113, January.
Franz Josef Och and Hermann Ney. 2003. A system-
atic comparison of various statistical alignment mod-
els. Computational Linguistics, 29(1):19–51.
Juan Miguel Vilar and Enrique Vidal. 2005. A recursive
statistical translation model. In Workshop on Build-
ing and Using Parallel Texts, Ann-Arbour (Michigan),
June.
Juan Miguel Vilar Torres. 1998. Aprendizaje de Tra-
ductores Subsecuenciales para su empleo en tareas
de dominio restringido. Ph.D. thesis, Departamento
de Sistemas Inform´aticos y Computaci´on, Universidad
Polit´ecnica de Valencia, Valencia (Spain). (in Span-
ish).
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23(3):377–403.
