File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0835_metho.xml

Size: 19,802 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0835">
  <Title>A Recursive Statistical Translation Model[?]</Title>
  <Section position="4" start_page="199" end_page="200" type="metho">
    <SectionTitle>
3 Some notation
</SectionTitle>
    <Paragraph position="0"> In the rest of the paper, we use the following notation. Sentences are taken as concatenations of symbols (words) and represented using a letter and a small bar, like in -x. The individual words are designed by the name of the sentence and a subindex indicating the position, so -x = x1x2 ...xn. The length of a sentence is indicated by |-x|. Segments of a sentence are denoted by -xji = xi ...xj. For the substrings of the form -x|-x|i we use the notation -x.i.</Paragraph>
    <Paragraph position="1"> Consistently, -x denotes the input sentence and -y its translation and both are assumed to have at least one word. The input and output vocabularies are X and Y, respectively. Finally, we assume that we are presentend a set M for training our models. The elements of this set are pairs (-x, -y) where -y is a possible translation for -x.</Paragraph>
    <Paragraph position="2"> 4 IBM's model 1 IBM's model 1 is the simplest of a hierarchy of five statistical models introduced in (Brown et al., 1993). Each model of the hierarchy can be seen as a refinement of the previous ones. Although model 1, which we study here, relies on the concept of alignment, its formulation allows an interpretation of it as a relationship between multisets of words (the order of the words is irrelevant in the final formula).</Paragraph>
    <Paragraph position="3"> A word of warning is in order here. The model we are going to present has an important difference with the original: we do not use the empty word. This is a virtual word which does not belong to the vocabulary of the task and that is added to the beginning of each sentence in order to allow words in the output that cannot be justified by the words in the input. We have decided not to incorporate it because of the use we are going to make of the model. As we will see, model 1 is going to be used repeatedly over different substrings of the input sentence in order to analyze their contribution to the total translation. This means that we would have an empty word in each of these substrings. We have decided to avoid this &amp;quot;proliferation&amp;quot; of empty words. Future work may introduce the concept in a more appropriate way.</Paragraph>
    <Paragraph position="4"> The model 1 makes two assumptions. That a stochastic dictionary can be employed to model the probability that word y is the translation of word x and that all the words in the input sentence have the same weight in producing a word in the output. This leads to:</Paragraph>
    <Paragraph position="6"> Where t is the stochastic dictionary and e represents a table that relates the length of the alignment with the length of the input sentence (we assume that there is a finite range of possible lengths). This explicit relations between the lengths is not present in  the original formulation of the model, but we prefer to include it so that the probabilities are adequately normalized.</Paragraph>
    <Paragraph position="7"> Clearly, this model is not adequate to describe complex translations in which complicated patterns and word order changes may appear. Nevertheless, this model can do a good job to describe the translation of short segments of texts. For example, it can be adequate to model the translation of the Spanish &amp;quot;gracias&amp;quot; into the English &amp;quot;thank you&amp;quot;.</Paragraph>
  </Section>
  <Section position="5" start_page="200" end_page="201" type="metho">
    <SectionTitle>
5 A Recursive Alignment Model
</SectionTitle>
    <Paragraph position="0"> To overcome that limitation of the model we will take the following approach: if the sentence is complex enough, it will be divided in two and the two halves will be translated independently and joined later; if the sentence is simple, the model 1 will be used.</Paragraph>
    <Paragraph position="1"> Let us formalize this intuition for the generative model. We are given an input sentence -x and the first decission is whether -x is going to be translated by IBM's model 1 or it is complex enough to be translated by MAR. In the second case, three steps are taken: a cut point of -x is defined, each of the resulting parts are translated, and the corresponding translations are concatenated. For the translation of the second step, the same process is recursively applied.</Paragraph>
    <Paragraph position="2"> The concatenation of the third step can be done in a &amp;quot;direct&amp;quot; way (the translation of the first part and then the translation of the second) or in an &amp;quot;inverse&amp;quot; way (the translation of the second part and then the translation of the first). The aim of this choice is to allow for the differences in word order between the input and ouput languages.</Paragraph>
    <Paragraph position="3"> So, we are proposing an alignment model in which IBM's model 1 will account for translation of elementary segments or individual words while translation of larger and more complex segments or whole sentences will rely on a hierarchical alignment pattern in which model 1 alignments will be on the lowest level of the hierarchy.</Paragraph>
    <Paragraph position="4"> Following this discussion, the model can be formally described through a series of four random experiments: null * The first is the selection of the model. It has two possible outcomes: IBM and MAR, with obvious meanings.</Paragraph>
    <Paragraph position="5"> * The second is the choice of b, a cut point of -x.</Paragraph>
    <Paragraph position="6"> The segment -xb1 will be used to generate one of the parts of the translation, the segment -x.b+1 will generate the other. It takes values from 1 to |-x|[?]1.</Paragraph>
    <Paragraph position="7"> * The third is the decision about the order of the concatenation. It has two possible outcomes: D (for direct) and I (for inverse).</Paragraph>
    <Paragraph position="8"> * The fourth is the translation of each of the halves of -x. They take values in Y+.</Paragraph>
    <Paragraph position="9"> The translation probability can be approximated as follows:</Paragraph>
    <Paragraph position="11"> The value of pI(-y  |-x) corresponds to IBM's model 1 (Equation 1). To derive pM(-y  |-x), we observe that:</Paragraph>
    <Paragraph position="13"> Note that the probability that -y is generated from a pair (-y1, -y2) is 0 if -y negationslash= -y1-y2 and 1 if -y = -y1-y2, so the last two lines can be rewritten as:</Paragraph>
    <Paragraph position="15"> where pref(-y) is the set of prefixes of -y. And finally:</Paragraph>
    <Paragraph position="17"> (2) The number of parameters of this model is very large, so it is necessary to introduce some simplifications in it. The first one relates to the decision of the translation model: we assume that it can be done just on the basis of the length of the input sentence. That is, we cat set up two tables, MI and MM, so</Paragraph>
    <Paragraph position="19"> Obviously, for any -x [?] X+, we will haveMI(|-x|)+ MM(|-x|) = 1. On the other hand, since it is not possible to break a one word sentence, we define MI(1) = 1. This restriction comes in the line mentioned before: the translation of longer sentences will be structured whereas shorter ones can be translated directly.</Paragraph>
    <Paragraph position="20"> In order to decide the cut point, we will assume that the probability of cutting the input sentence at a given position b is most influenced by the words around it: xb and xb+1. We use a table B such that:</Paragraph>
    <Paragraph position="22"> This can be interpreted as having a weight for each pair of words and normalizing these weights in each sentence in order to obtaing a proper probability distribution. null Two more tables, DD and DI, are used to store the probabilities that the alignment be direct or inverse. As before, we assume that the decission can be made on the basis of the symbols around the cut point:</Paragraph>
    <Paragraph position="24"> Again, we have DD(xb,xb+1) + DI(xb,xb+1) = 1 for every pair of words (xb,xb+1).</Paragraph>
    <Paragraph position="25"> Finally, a probability must be assigned to the translation of the two halves. Assuming that they are independent we can apply the model in a recursive manner:</Paragraph>
    <Paragraph position="27"> Finally, we can rewrite (2) as:</Paragraph>
    <Paragraph position="29"/>
  </Section>
  <Section position="6" start_page="201" end_page="202" type="metho">
    <SectionTitle>
6 Parameter estimation
</SectionTitle>
    <Paragraph position="0"> Once the model is defined, it is necessary to find a way of estimating its parameters given a training corpus M. We will use maximun likelihood estimation. In our case, the likelihood of the sample corpus is:</Paragraph>
    <Paragraph position="2"> In order to maximize V , initial values are given to the parameters and they are reestimated using repeatedly Baum-Eagon's (Baum and Eagon, 1967) and Gopalakrishnan's (Gopalakrishnan et al., 1991) inequalities. Let P be a parameter of the model (except for those in B) and let F(P) be its &amp;quot;family&amp;quot; (i.e. the set of parameters such that summationtextQ[?]F(P) Q = 1). Then, a new value of P can be computed as follows:</Paragraph>
    <Paragraph position="4"> are the &amp;quot;counts&amp;quot; of parameter P. This is correct as long as V is a polynomial in P. However, we have a problem for B since V is a rational function of these parameters. We can solve it by assuming, without lose of generality, that summationtextx1,x2[?]X B(x1,x2) = 1. Then Gopalakrishnan's inequality can be applied similarly and we get:</Paragraph>
    <Paragraph position="6"> where C is an adequate constant. Now it is easy to design a reestimation algorithm. The algorithm gives arbitrary initial values to the parameters (typically those corresponding to uniform probabilities), computes the counts of the parameters for the corpus and, using either (4) or (6), gets new values for the parameters. This cycle is repeated until a stopping criterion (in our case a prefixed number of iterations) is met. This algorithm can be seen in Figure 1</Paragraph>
  </Section>
  <Section position="7" start_page="202" end_page="203" type="metho">
    <SectionTitle>
7 Some notes on efficiency
</SectionTitle>
    <Paragraph position="0"> Estimating the parameters as discussed above entails high computational costs: computing pT(-y  |-x) requires O(mn) arithmetic operations involving the values of pT(-yji  |-xlk) for every possible value of i, j, k and l, which are O(m2n2). This results in a global cost of O(m3n3). On the other hand, computing [?] pT[?] P costs as much as computing pT . So it is interesting to keep the number of computed derivatives low.</Paragraph>
    <Section position="1" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
7.1 Reduction of the parameters to train
</SectionTitle>
      <Paragraph position="0"> In the experiments we have followed some heuristics in order not to reestimate certain parameters: * The values of MI --and, consequently, of MM-- for lengths higher than a threshold are assumed to be 0 and therefore there is no need to estimate them.</Paragraph>
      <Paragraph position="1"> * As a consequence, the values of e for lengths above the same threshold, need not be reestimated. null * The values of t for pairs of words with counts under a certain threshold are not reestimated. Furthermore, during the computation of counts, the recursion is cut on those substring pairs where the value of the probability for the translation is very small.</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
7.2 Efficient computation of model 1
</SectionTitle>
      <Paragraph position="0"> Other source of optimization is the realization that for computing pT(-y  |-x), it is necessary to compute the value of pI for each possible pair (-xieib, -yoeob) (where ib, ie, ob and oe stand for input begin, input end, output begin and output end, respectively).</Paragraph>
      <Paragraph position="1"> Fortunately, it is possible to accelerate this computations. First, define:</Paragraph>
      <Paragraph position="3"> Algorithm Maximum likelihood estimation give initial values to the parameters; repeat initialize the counts to 0;</Paragraph>
      <Paragraph position="5"> for each parameter P do  This leads to</Paragraph>
      <Paragraph position="7"> if ob negationslash= oe.</Paragraph>
      <Paragraph position="8"> So we can compute all values of I with the algorithm in Figure 2.</Paragraph>
    </Section>
    <Section position="3" start_page="203" end_page="203" type="sub_section">
      <SectionTitle>
7.3 Splitting the corpora
</SectionTitle>
      <Paragraph position="0"> Another way of reducing the costs of training has been the use of a heuristic to split long sentences into smaller parts with a length less than l words.</Paragraph>
      <Paragraph position="1"> Suppose we are to split sentences -x and -y. We begin by aligning each word in -y to a word in -x.</Paragraph>
      <Paragraph position="2"> Then, a score and a translation is assigned to each substring -xji with a length below l. The translation is produced by looking for the substring of -y which has a length below l and which has the largest number of words aligned to positions between i and j. The pair so obtained is given a score equal to sum of: (a) the square of the length of -xji; (b) the square of the number of words in the output aligned to the input; and (c) minus ten times the sum of the square of the number of words aligned to a nonempty position out of -xji and the number of words outside the segment chosen that are aligned to -xji.</Paragraph>
      <Paragraph position="3"> After the segments of -x are so scored, the partition of -x that maximizes the sum of scores is computed by dynamic programming.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="203" end_page="204" type="metho">
    <SectionTitle>
8 Translating the test sentences
</SectionTitle>
    <Paragraph position="0"> The MAR model can be used to obtain adequate bilingual templates which can be used to translate new test sentences using an appropriate template-based translation system. Here we have adopted the pharaoh program (Koehn, 2004).</Paragraph>
    <Section position="1" start_page="203" end_page="204" type="sub_section">
      <SectionTitle>
8.1 Finding the templates
</SectionTitle>
      <Paragraph position="0"> The parameters of the MAR were trained using the algorithm above: first ten IBM model 1 iterations were used for giving initial values to the dictionary probabilities and then five more iterations for re-training the dictionary together with the rest of the parameters.</Paragraph>
      <Paragraph position="1"> The alignment of a pair has the form of a tree similar to the one in Figure 3 (this is one of the sentences from the Spanish-English part of the training corpus). Each interior node has two children corresponding to the translation of the two parts in which the input sentence is divided. The leaves of the tree correspond to those segments that were translated by model 1. The templates generated were those defined by the leaves. Further templates were obtained by interpreting each pair of words in the dictionary as a template.</Paragraph>
      <Paragraph position="3"> Each template was assigned four weights1 in order to use the pharaoh program. For the templates obtained from the alignments, the first weight was the probability assigned to it by MAR, the second weight was the count for the template, i.e., the number of times that template was found in the corpus, the third weight was the normalized count, i.e., the number of times the template appeared in the corpus divided by the number of times the input part was present in the corpus, finally, the fourth weight was a small constant (10[?]30). The intention of this last weight was to ease the combination with the templates from the dictionary. For these, the first three weights were assigned the same small constant and the fourth was the probability of the translation of the pair obtained from the stochastic dictionary. This weighting schema allowed to separate the influence of the dictionary in smoothing the templates.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="204" end_page="206" type="metho">
    <SectionTitle>
9 Experiments
</SectionTitle>
    <Paragraph position="0"> In order to test the model, we have decided to participate in the shared task for this workshop.</Paragraph>
    <Section position="1" start_page="204" end_page="205" type="sub_section">
      <SectionTitle>
9.1 The task
</SectionTitle>
      <Paragraph position="0"> The aim of the task was to translate a set of 2,000 sentences from German, Spanish, Finnish and French into English. Those sentences were extracted from the Europarl corpus (Koehn, Unpublished). As training material, four different corpora were provided, one for each language pair, comprising around 700000 sentence pairs each. Some details about these corpora can be seen in Table 1. An automatic alignment for each corpus was also provided. null The original sentence pairs were splitted using the techniques discussed in section 7.3. The total number of sentences after the split is presented in Table 2. Two different alignments were used: (a) the one provided in the definition of the task and (b) one obtained using GIZA++ (Och and Ney, 2003) to train an IBM's model 4. As it can be seen, the number of parts is very similar in both cases. The  a maximum length of ten. &amp;quot;Provided&amp;quot; refers to the alignment provided in the task, &amp;quot;GIZA++&amp;quot; to those obtained with GIZA++.</Paragraph>
      <Paragraph position="1">  pair: &amp;quot;Alignment&amp;quot; shows the number of templates derived from the alignments; &amp;quot;dictionary&amp;quot;, those obtained from the dictionary; and &amp;quot;total&amp;quot; is the sum. (a) Using the alignments provided with the task.</Paragraph>
      <Paragraph position="2">  number of pairs after splitting is roughly three times the original.</Paragraph>
      <Paragraph position="3"> Templates were extracted as described in section 8.1. The number of templates we obtained can be seen in Table 3. Again, the influence of the type of alignment was small. Except for Finnish, the number of dictionary templates was roughly two thirds of the templates extracted from the alignments. null</Paragraph>
    </Section>
    <Section position="2" start_page="205" end_page="206" type="sub_section">
      <SectionTitle>
9.2 Obtaining the translations
</SectionTitle>
      <Paragraph position="0"> Once the templates were obtained, the development corpora were used to search for adequate values of Table 4: Best weights for each language pair. The columns are for the probability given by the model, the counts of the templates, the normalized counts and the weight given to the dictionary.</Paragraph>
      <Paragraph position="1"> (a) Using the alignments provided with the task.</Paragraph>
      <Paragraph position="2">  the weights that pharaoh uses for each template (these are the weights passed to option weight-t, the other weights were not changed as an initial exploration seemed to indicate that they had little impact). As expected, the best weights differed between language pairs. The values can be seen in table 4.</Paragraph>
      <Paragraph position="3"> It is interesting to note that the probabilities assigned by the model to the templates seemed to be better not taken into account. The most important feature was the counts of the templates, which sometimes were helped by the use of the dictionary, although that effect was small. Normalization of counts also had little impact.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML