File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3107_metho.xml

Size: 16,886 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3107">
  <Title>Searching for alignments in SMT. A novel approach based on an Estimation of Distribution Algorithm [?]</Title>
  <Section position="3" start_page="47" end_page="47" type="metho">
    <SectionTitle>
2 Word Alignments In Statistical Machine
</SectionTitle>
    <Paragraph position="0"> translation In statistical machine translation, a word alignment between two sentences (a source sentence f and a target sentence e) defines a mapping between the words f1...fJ in the source sentence and the words e1..eI in the target sentence. The search for the optimal alignment between the source sentence f and the target sentence e can be stated as:</Paragraph>
    <Paragraph position="2"> being A the set of all the possible alignments between f and e.</Paragraph>
    <Paragraph position="3"> The transformation made in Eq. (1) allows us to address the alignment problem by using the statitistical approach to machine translation described as follows. This approach can be stated as: a source language string f = fJ1 = f1 ...fJ is to be translated into a target language string e = eI1 = e1 ...eI.</Paragraph>
    <Paragraph position="4"> Every target string is regarded as a possible translation for the source language string with maximum a-posteriori probability Pr(e|f). According to Bayes' decision rule, we have to choose the target string that maximizes the product of both the target language model Pr(e) and the string translation model Pr(f|e). Alignment models to structure the translation model are introduced in (Brown et al., 1993). These alignment models are similar to the concept of Hidden Markov models (HMM) in speech recognition. The alignment mapping is j - i = aj from source position j to target position i = aj. In statistical alignment models, Pr(f,a|e), the alignment a is usually introduced as a hidden variable. Nevertheless, in the problem described in this article, the source and the target sentences are given, and we are focusing on the optimization of the aligment a.</Paragraph>
    <Paragraph position="5"> The translation probability Pr(f,a|e) can be rewritten as follows:</Paragraph>
    <Paragraph position="7"> The probability Pr(f,a|e) can be estimated by using the word-based IBM statistical alignment models (Brown et al., 1993). These models, however, constrain the set of possible alignments so that each word in the source sentence can be aligned at most to one word in the target sentence. Of course, &amp;quot;real&amp;quot; alignments, in most of the cases, do not follow this limitation. Hence, the alignments obtained from the IBM models have to be extended in some way to achieve more realistic alignments. This is usually performed by computing the alignments in both directions (i.e, first from f to e and then from e to f) and then combining them in a suitable way (this process is known as symmetrization).</Paragraph>
  </Section>
  <Section position="4" start_page="47" end_page="48" type="metho">
    <SectionTitle>
3 Estimation of Distribution Algorithms
</SectionTitle>
    <Paragraph position="0"> Estimation of Distribution Algorithms (EDAs) (Larra~naga and Lozano, 2001) are metaheuristics which has gained interest during the last five years due to their high performance when solving combinatorial optimization problems. EDAs, as well as genetics algorithms (Michalewicz, 1996), are population-based evolutionary algorithms but, instead of using genetic operators are based on the estimation/learning and posterior sampling of a probability distribution, which relates the variables or genes forming and individual or chromosome. In this way the dependence/independence relations between these variables can be explicitly modelled in the EDAs framework. The operation mode of a canonical EDA is shown in Figure 1.</Paragraph>
    <Paragraph position="1"> As we can see, the algorithm maintains a population of m individuals during the search. An individual is a candidate or potential solution to the problem being optimized, e.g., in the problem considered here an individual would be a possible alignment. Usually, in combinatorial optimization problems an individual is represented as a vector of integers a = &lt;a1,...,aJ&gt; , where each position aj can  take a set of finite values Ohmaj = {0,...,I}. The first step in an evolutionary algorithm is to generate the initial population D0. Although D0 is usually generated randomly (to ensure diversity), prior knowledge can be of utility in this step.</Paragraph>
    <Paragraph position="2"> Once we have a population our next step is to evaluate it, that is, we have to measure the goodness or fitness of each individual with respect to the problem we are solving. Thus, we use a fitness function f(a) = Pr(f,a|e) (see Eq. (3)) to score individuals. Evolutionary algorithms in general and EDAs in particular seek to improve the quality of the individuals in the population during the search. In genetic algorithms the main idea is to build a new population from the current one by copying some individuals and constructing new ones from those contained in the current population. Of course, as we aim to improve the quality of the population with respect to fitness, the best/fittest individuals have more chance to be copied or selected for recombination.</Paragraph>
    <Paragraph position="3"> In EDAs, the transition between populations is quite different. The basic idea is to summarize the properties of the individuals in the population by learning a probability distribution that describes them as much as possible. Since the quality of the population should be improved in each step, only the s fittest individuals are selected to be included in the dataset used to learn the probability distribution Pr(a1,...,aJ), in this way we try to discover the common regularities among good individuals. The next step is to obtain a set of new individuals by sampling the learnt distribution. These individuals are scored by using the fitness function and added to the ones forming the current population. Finally, the new population is formed by selecting n individuals from the 2n contained in the current one. A common practice is to use some kind of fitness-based elitism during this selection, in order to guarantee that the best(s) individual(s) is/are retained.</Paragraph>
    <Paragraph position="4"> The main problem in the previous description is related to the estimation/learning of the probability distribution, since estimating the joint distribution is intractable in most cases. In the practice, what is learnt is a probabilistic model that consists in a factorization of the joint distribution. Different levels of complexity can be considered in that factorization, from univariate distributions to n-variate ones or Bayesian networks (see (Larra~naga and Lozano, 2001, Chapter 3) for a review). In this paper, as this is the first approximation to the alignment problem with EDAs and, because of some questions that will be discussed later, we use the simplest EDA model: the Univariate Marginal Distribution Algorithm or UMDA (Muhlenbein, 1997). In UMDA it is assumed that all the variables are marginally independent, thus, the n-dimensional probability distribution, Pr(a1,...,aJ), is factorized as the product of J marginal/unidimensional distributions:producttext J j=1 Pr(aj). Among the advantages of UMDA we can cite the following: no structural learning is needed; parameter learning is fast; small dataset can be used because only marginal probabilities have to be estimated; and, the sampling process is easy because each variable is independently sampled.</Paragraph>
  </Section>
  <Section position="5" start_page="48" end_page="50" type="metho">
    <SectionTitle>
4 Design of an EDA to search for
</SectionTitle>
    <Paragraph position="0"> alignments In this section, an EDA algorithm to align a source and a target sentences is described.</Paragraph>
    <Section position="1" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
4.1 Representation
</SectionTitle>
      <Paragraph position="0"> One of the most important issues in the definition of a search algorithm is to properly represent the space of solutions to the problem. In the problem considered here, we are searching for an &amp;quot;optimal&amp;quot; alignment between a source sentence f and a target sentence e. Therefore, the space of solutions can be stated as the set of possible alignments between both sentences. Owing to the constraints imposed by the IBM models (a word in f can be aligned at most to one word in e), the most natural way to represent a  solution to this problem consists in storing each possible alignment in a vector a = a1...aJ, being J the length of f. Each position of this vector can take the value of &amp;quot;0&amp;quot; to represent a NULL alignment (that is, a word in the source sentence that is aligned to no words in the target sentence) or an index representing any position in the target sentence. An example of alignment is shown in Figure 4.1.</Paragraph>
      <Paragraph position="1"> Please, I wouldliketo booka roomnull deseariareservaruna habitacion.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
4.2 Evaluation function
</SectionTitle>
      <Paragraph position="0"> During the search process, each individual (search hypothesis) is scored using the fitness function described as follows. Let a = a1***aJ be the alignment represented by an individual. This alignmenta is evaluated by computing the probability p(f,a|e).</Paragraph>
      <Paragraph position="1"> This probability is computed by using the IBM model 4 as:</Paragraph>
      <Paragraph position="3"> where the factors separated by x symbols denote fertility, translation, head permutation, non-head permutation, null-fertility, and null-translation probabilities1. null This model was trained using the GIZA++ toolkit (Och and Ney, 2003) on the material available for the different alignment tasks described in section 5.1</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
4.3 Search
</SectionTitle>
      <Paragraph position="0"> In this section, some specific details about the search are given. As was mentioned in section 3, the algorithm starts by generating an initial set of hypotheses (initial population). In this case, a set of randomly generated alignments between the source and the target sentences are generated. Afterwards, all the individuals in this population (a fragment of a real population is shown in figure 3) are scored using the function defined in Eq.(4.2). At this point, the actual search starts by applying the scheme shown in section 3, thereby leading to a gradual improvement in the hypotheses handled by the algorithm in each step of the search.</Paragraph>
      <Paragraph position="1"> This process finishes when some finalization criterium (or criteria) is reached. In our implementation, the algorithm finishes when it passes a certain number of generations without improving the quality of the hypotheses (individuals). Afterwards, the best individual in the current population is returned as the final solution.</Paragraph>
      <Paragraph position="2"> Regarding the EDA model, as commented before, our approach rely on the UMDA model due mainly to the size of the search space defined by the task.</Paragraph>
      <Paragraph position="3"> The algorithm has to deal with individuals of length J, where each position can take (I + 1) possible values. Thus, in the case of UMDA, the number of free parameters to be learnt for each position is I (e.g., in the English-French task avg(J) = 15 and avg(I) = 17.3). If more complex models were considered, the size of the probability tables would have grown exponentially. As an example, in a bivariate model, each variable (position) is conditioned on another variable and thus the probability tables P(.|.) to be learnt have I(I + 1) free parameters. In order to properly estimate the probabilty distributions, the size of the populations has to be increased considerably. As a result, the computational resources 1The symbols in this formula are: J (the length of e), I (the length of f), ei (the i-th word in eI1), e0 (the NULL word), phi (the fertility of ei), tik (the k-th word produced by ei in a), piik (the position of tik in f), ri (the position of the first fertile word to the left of ei in a), cri (the ceiling of the average of all pirik for ri, or 0 if ri is undefined).</Paragraph>
      <Paragraph position="5"> the search for the alignments between the English sentence and then he tells us the correct result ! and the Romanian sentence si ne spune noua rezultatul corect !. These sentences are part of the HLT-NAACL 2005 shared task. Some individuals and their scores (fitness) are shown.</Paragraph>
      <Paragraph position="6"> required by the algorithm rise dramatically.</Paragraph>
      <Paragraph position="7"> Finally, as was described in section 3, some parameters have to be fixed in the design of an EDA.</Paragraph>
      <Paragraph position="8"> On the one hand, the size of each population must be defined. In this case, this size is proportional to the length of the sentences to be aligned. Specifically, the size of the population adopted is equal to the length of source sentence f multiplied by a factor of ten.</Paragraph>
      <Paragraph position="9"> On the other hand, as we mentioned in section 3 the probability distribution over the individuals is not estimated from the whole population. In the present task about 20% of the best individuals in each population are used for this purpose.</Paragraph>
      <Paragraph position="10"> As mentioned above, the fitness function used in the algorithm just allows for unidirectional alignments. Therefore, the search was conducted in both directions (i.e, from f to e and from e to f) combining the final results to achieve bidirectional alignments. To this end, diffferent approaches (symmetrization methods) were tested. The results shown in section 5.2 were obtained by applying the refined method proposed in (Och and Ney, 2000).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="50" end_page="51" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Different experiments have been carried out in order to assess the correctness of the search algorithm.</Paragraph>
    <Paragraph position="1"> Next, the experimental metodology employed and the results obtained are described.</Paragraph>
    <Section position="1" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
5.1 Corpora and evaluation
</SectionTitle>
      <Paragraph position="0"> Three different corpora and four different test sets have been used. All of them are taken from the two shared tasks in word alignments developed in HLT/NAACL 2003 (Mihalcea and Pedersen, 2003) and ACL 2005 (Joel Martin, 2005). These two tasks involved four different pair of languages, English-French, Romanian-English, English-Inuktitut and English-Hindi. English-French and Romanian-English pairs have been considered in these experiments (owing to the lack of timeto properly pre-process the Hindi and the Inuktitut). Next, a brief description of the corpora used is given.</Paragraph>
      <Paragraph position="1"> Regarding the Romanian-English task, the test data used to evaluate the alignments consisted in 248 sentences for the 2003 evaluation task and 200 for the 2005 evaluation task. In addition to this, a training corpus, consisting of about 1 million Romanian words and about the same number of English word has been used. The IBM word-based alignment models were training on the whole corpus (training + test). On the other hand, a subset of the Canadian Hansards corpus has been used in the English-French task. The test corpus consists of 447 English-French sentences. The training corpus contains about 20 million English words, and about the same number of French words. In Table 1, the features of the different corpora used are shown.</Paragraph>
      <Paragraph position="2"> To evaluate the quality of the final alignments obtained, different measures have been taken into account: Precision, Recall, F-measure, and Alignment Error Rate. Given an alignment A and a reference alignment G (both A and G can be split into two subsets AS,AP and GS, GP , respectively representing Sure and Probable alignments) Precision (PT ), Recall (RT ), F-measure (FT ) and Alignment Error Rate (AER) are computed as (where T is the alignment type, and can be set to either S or P):</Paragraph>
      <Paragraph position="4"> It is important to emphasize that EDAs are nondeterministics algorithms. Because of this, the results presented in section 5.2 are actually the mean of the results obtained in ten different executions of the search algorithm.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML