File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0815_intro.xml
Size: 3,100 bytes
Last Modified: 2025-10-06 14:03:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0815"> <Title>Experiments Using MAR for Aligning Corpora[?]</Title> <Section position="2" start_page="0" end_page="95" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We present the experiments we conducted within the context of the shared task of the track on building and using parallel texts for languages with scarce resources of the ACL 2005 Workshop on Building and Using Parallel Texts. The aim of the task was to align the words of sentence pairs in different language pairs. We have participated using the Romanian-English corpora.</Paragraph> <Paragraph position="1"> We have used a new model, the MAR (from the Spanish initials of Recursive Alignment Model) that allowed us to find structured alignments that were later transformed in a more conventional format.</Paragraph> <Paragraph position="2"> The basic idea of the model is that the translation of a sentence can be obtained in three steps: first, the sentence is divided in two parts; second, each part is translated separately using the same process; and [?]Work partially supported by Bancaixa through the project &quot;Sistemas Inductivos, Estad'isticos y Estructurales, para la Traducci'on Autom'atica (SIEsTA)&quot;.</Paragraph> <Paragraph position="3"> third, the two translations are joined. The high computational costs associated with the training of the model made it necessary to split the training pairs in smaller parts using a simple heuristic.</Paragraph> <Paragraph position="4"> Initial work with this model can be seen in (Vilar Torres, 1998). A detailed presentation can be found in (Vilar and Vidal, 2005). This model shares some similarities with the stochastic inversion transduction grammars (SITG) presented by Wu in (Wu, 1997). The main point in common is the number of possible alignments between the two models.</Paragraph> <Paragraph position="5"> On the other hand, the parametrizations of SITGs and the MAR are completely different. The generative process of SITGs produces simultaneously the input and output sentences and the parameters of the model refer to the rules of the nonterminals. This gives a clear symmetry to both input and output sentences. Our model clearly distinguishes an input and output sentence and the parameters are based on observable properties of the sentences (their lengths and the words composing them). Also, the idea of splitting the sentences until a simple structure is found in the Divisive Clustering presented in (Deng et al., 2004). Again, the main difference is in the probabilistic modeling of the alignments. In Divisive Clustering a uniform distribution on the alignments is assumed while MAR uses a explicit parametrization.</Paragraph> <Paragraph position="6"> The rest of the paper is structured as follows: the next section gives an overview of the MAR, then we explain the task and how the corpora were split, after that, how the alignments were obtained is explained, finally the results and conclusions are presented.</Paragraph> </Section> class="xml-element"></Paper>