XML Viewer - p03-1019

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1019_metho.xml
Size: 21,451 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1019">
  <Title>A Comparative Study on Reordering Constraints in Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Theoretical Discussion
</SectionTitle>
    <Paragraph position="0"> In this section, we will discuss the reordering constraints from a theoretical point of view. We will answer the question of how many word-reorderings are permitted for the ITG constraints as well as for the IBM constraints. Since we are only interested in the number of possible reorderings, the specific word identities are of no importance here. Furthermore, we assume a one-to-one correspondence between source and target words. Thus, we are interested in the number of word-reorderings, i.e. permutations, that satisfy the chosen constraints. First, we will consider the ITG constraints. Afterwards, we will describe the IBM constraints.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 ITG Constraints
</SectionTitle>
      <Paragraph position="0"> Let us now consider the ITG constraints. Here, we interpret the input sentence as a sequence of blocks.</Paragraph>
      <Paragraph position="1"> In the beginning, each position is a block of its own.</Paragraph>
      <Paragraph position="2"> Then, the permutation process can be seen as follows: we select two consecutive blocks and merge them to a single block by choosing between two options: either keep them in monotone order or invert the order. This idea is illustrated in Fig. 1. The white boxes represent the two blocks to be merged.</Paragraph>
      <Paragraph position="3"> Now, we investigate, how many permutations are obtainable with this method. A permutation derived by the above method can be represented as a binary tree where the inner nodes are colored either black or white. At black nodes the resulting sequences of the children are inverted. At white nodes they are kept in monotone order. This representation is equivalent to source positions target positions without inversion with inversion  catenation of two consecutive blocks.</Paragraph>
      <Paragraph position="4"> the parse trees of the simple grammar in (Wu, 1997). We observe that a given permutation may be constructed in several ways by the above method. For instance, let us consider the identity permutation of 1;2;:::;n. Any binary tree with n nodes and all inner nodes colored white (monotone order) is a possible representation of this permutation. To obtain a unique representation, we pose an additional constraint on the binary trees: if the right son of a node is an inner node, it has to be colored with the opposite color. With this constraint, each of these binary trees is unique and equivalent to a parse tree of the 'canonical-form' grammar in (Wu, 1997).</Paragraph>
      <Paragraph position="5"> In (Shapiro and Stephens, 1991), it is shown that the number of such binary trees with n nodes is the (n ! 1)th large Schr&amp;quot;oder number Sn!1. The (small) Schr&amp;quot;oder numbers have been first described in (Schr&amp;quot;oder, 1870) as the number of bracketings of a given sequence (Schr&amp;quot;oder's second problem). The large Schr&amp;quot;oder numbers are just twice the Schr&amp;quot;oder numbers. Schr&amp;quot;oder remarked that the ratio between two consecutive Schr&amp;quot;oder numbers approaches 3 +</Paragraph>
      <Paragraph position="7"> The Schr&amp;quot;oder numbers have many combinatorical interpretations. Here, we will mention only two of them. The first one is another way of viewing at the ITG constraints. The number of permutations of the sequence 1;2;:::;n, which avoid the subsequences (3;1;4;2) and (2;4;1;3), is the large Schr&amp;quot;oder number Sn!1. More details on forbidden subsequences can be found in (West, 1995). The interesting point is that a search with the ITG constraints cannot generate a word-reordering that contains one of these two subsequences. In (Wu, 1997), these forbidden subsequences are called 'inside-out' transpositions.</Paragraph>
      <Paragraph position="8"> Another interpretation of the Schr&amp;quot;oder numbers is given in (Knuth, 1973): The number of permutations that can be sorted with an output-restricted doubleended queue (deque) is exactly the large Schr&amp;quot;oder number. Additionally, Knuth presents an approximation for the large Schr&amp;quot;oder numbers: Sn ... cC/(3+p8)n C/n!32 (3) where c is set to 12 q (3p2!4)=.... This approximation function confirms the result of Schr&amp;quot;oder, and we obtain Sn 2 Th((3 + p8)n), i.e. the Schr&amp;quot;oder numbers grow like (3+p8)n ... 5:83n.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 IBM Constraints
</SectionTitle>
      <Paragraph position="0"> In this section, we will describe the IBM constraints (Berger et al., 1996). Here, we mark each position in the source sentence either as covered or uncovered.</Paragraph>
      <Paragraph position="1"> In the beginning, all source positions are uncovered.</Paragraph>
      <Paragraph position="2"> Now, the target sentence is produced from bottom to top. A target position must be aligned to one of the first k uncovered source positions. The IBM constraints are illustrated in Fig. 2.</Paragraph>
      <Paragraph position="3">  For most of the target positions there are k permitted source positions. Only towards the end of the sentence this is reduced to the number of remaining uncovered source positions. Let n denote the length of the input sequence and let rn denote the permitted number of permutations with the IBM constraints.</Paragraph>
      <Paragraph position="4"> Then, we obtain:</Paragraph>
      <Paragraph position="6"> Typically, k is set to 4. In this case, we obtain an asymptotic upper and lower bound of 4n, i.e. rn 2 Th(4n).</Paragraph>
      <Paragraph position="7"> In Tab. 1, the ratio of the number of permitted re-orderings for the discussed constraints is listed as a function of the sentence length. We see that for longer sentences the ITG constraints allow for more reorderings than the IBM constraints. For sentences of length 10 words, there are about twice as many reorderings for the ITG constraints than for the IBM constraints. This ratio steadily increases. For longer sentences, the ITG constraints allow for much more flexibility than the IBM constraints.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Search
</SectionTitle>
    <Paragraph position="0"> Now, let us get back to more practical aspects. Re-ordering constraints are more or less useless, if they do not allow the maximization of Eq. 2 to be performed in an efficient way. Therefore, in this section, we will describe different aspects of the search algorithm for the ITG constraints. First, we will present the dynamic programming equations and the resulting complexity. Then, we will describe pruning techniques to accelerate the search. Finally, we will extend the basic algorithm for the generation of word graphs.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Algorithm
</SectionTitle>
      <Paragraph position="0"> The ITG constraints allow for a polynomial-time search algorithm. It is based on the following dynamic programming recursion equations. During the search a table Qjl;jr;eb;et is constructed. Here, Qjl;jr;eb;et denotes the probability of the best hypothesis translating the source words from position jl (left) to position jr (right) which begins with the target language word eb (bottom) and ends with the word et (top). This is illustrated in Fig. 3.</Paragraph>
      <Paragraph position="1"> Here, we initialize this table with monotone translations of IBM Model 4. Therefore, Q0jl;jr;eb;et denotes the probability of the best monotone hypothesis of IBM Model 4. Alternatively, we could use any other single-word based lexicon as well as phrase-based models for this initialization. Our choice is the IBM Model4 to make the results as comparable  rn for different sentence lengths n.</Paragraph>
      <Paragraph position="2"> n 1 ... 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  as possible to the search with the IBM constraints. We introduce a new parameter pm (m^= monotone), which denotes the probability of a monotone combination of two partial hypotheses.</Paragraph>
      <Paragraph position="4"> We formulated this equation for a bigram language model, but of course, the same method can also be applied for a trigram language model. The resulting algorithm is similar to the CYK-parsing algorithm. It has a worst-case complexity of O(J3 C/ E4). Here, J is the length of the source sentence and E is the vocabulary size of the target language.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Pruning
</SectionTitle>
      <Paragraph position="0"> Although the described search algorithm has a polynomial-time complexity, even with a bigram language model the search space is very large. A full search is possible but time consuming. The situation gets even worse when a trigram language model is used. Therefore, pruning techniques are obligatory to reduce the translation time.</Paragraph>
      <Paragraph position="1"> Pruning is applied to hypotheses that translate the same subsequence fjrjl of the source sentence. We use pruning in the following two ways. The first pruning technique is histogram pruning: we restrict the number of translation hypotheses per sequence fjrjl . For each sequence fjrjl , we keep only a fixed number of translation hypotheses. The second pruning technique is threshold pruning: the idea is to remove all hypotheses that have a low probability relative to the best hypothesis. Therefore, we introduce a threshold pruning parameter q, with 0 * q * 1.</Paragraph>
      <Paragraph position="2"> Let Q/jl;jr denote the maximum probability of all translation hypotheses for fjrjl . Then, we prune a hypothesis iff: Qjl;jr;eb;et &lt; q C/Q/jl;jr Applying these pruning techniques the computational costs can be reduced significantly with almost no loss in translation quality.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Generation of Word Graphs
</SectionTitle>
      <Paragraph position="0"> The generation of word graphs for a bottom-top search with the IBM constraints is described in (Ueffing et al., 2002). These methods cannot be applied to the CYK-style search for the ITG constraints. Here, the idea for the generation of word graphs is the following: assuming we already have word graphs for the source sequences fkjl and fjrk+1, then we can construct a word graph for the sequence fjrjl by concatenating the partial word graphs either in monotone or inverted order.</Paragraph>
      <Paragraph position="1"> Now, we describe this idea in a more formal way.</Paragraph>
      <Paragraph position="2"> A word graph is a directed acyclic graph (dag) with one start and one end node. The edges are annotated with target language words or phrases. We also allow +-transitions. These are edges annotated with the empty word. Additionally, edges may be annotated with probabilities of the language or translation model. Each path from start node to end node represents one translation hypothesis. The probability of this hypothesis is calculated by multiplying the probabilities along the path.</Paragraph>
      <Paragraph position="3"> During the search, we have to combine two word graphs in either monotone or inverted order. This is done in the following way: we are given two word graphs w1 and w2 with start and end nodes (s1;g1) and (s2;g2), respectively. First, we add an +-transition (g1;s2) from the end node of the first graph w1 to the start node of the second graph w2 and annotate this edge with the probability of a monotone concatenation pm. Second, we create a copy of each of the original word graphs w1 and w2.</Paragraph>
      <Paragraph position="4"> Then, we add an +-transition (g2;s1) from the end node of the copied second graph to the start node of the copied first graph. This edge is annotated with the probability of a inverted concatenation 1 ! pm.</Paragraph>
      <Paragraph position="5"> Now, we have obtained two word graphs: one for a monotone and one for a inverted concatenation. The final word graphs is constructed by merging the two start nodes and the two end nodes, respectively.</Paragraph>
      <Paragraph position="6"> Let W(jl;jr) denote the word graph for the source sequence fjrjl . This graph is constructed from the word graphs of all subsequences of (jl;jr).</Paragraph>
      <Paragraph position="7"> Therefore, we assume, these word graphs have al- null ready been produced. For all source positions k with jl * k &lt; jr, we combine the word graphs W(jl;k) and W(k + 1;jr) as described above. Finally, we merge all start nodes of these graphs as well as all end nodes. Now, we have obtained the word graph  W(jl;jr) for the source sequence fjrjl . As initialization, we use the word graphs of the monotone IBM4 search.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Extended ITG constraints
</SectionTitle>
      <Paragraph position="0"> In this section, we will extend the ITG constraints described in Sec. 2.1. This extension will go beyond basic reordering constraints.</Paragraph>
      <Paragraph position="1"> We already mentioned that the use of consecutive phrases within the ITG approach is straightforward.</Paragraph>
      <Paragraph position="2"> The only thing we have to change is the initialization of the Q-table. Now, we will extend this idea to phrases that are non-consecutive in the source language. For this purpose, we adopt the view of the ITG constraints as a bilingual grammar as, e.g., in (Wu, 1997). For the baseline ITG constraints, the resulting grammar is: A ! [AA] j hAAi j f=e j f=+ j +=e Here, [AA] denotes a monotone concatenation and hAAi denotes an inverted concatenation.</Paragraph>
      <Paragraph position="3"> Let us now consider the case of a source phrase consisting of two parts f1 and f2. Let e denote the corresponding target phrase. We add the productions A ! [e=f1 A +=f2] j he=f1 A +=f2i to the grammar. The probabilities of these productions are, dependent on the translation direction, p(ejf1;f2) or p(f1;f2je), respectively. Obviously, these productions are not in the normal form of an ITG, but with the method described in (Wu, 1997), they can be normalized.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Corpus Statistics
</SectionTitle>
    <Paragraph position="0"> In the following sections we will present results on two tasks. Therefore, in this section we will show the corpus statistics for each of these tasks.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Verbmobil
</SectionTitle>
      <Paragraph position="0"> The first task we will present results on is the Verbmobil task (Wahlster, 2000). The domain of this corpus is appointment scheduling, travel planning, and hotel reservation. It consists of transcriptions of spontaneous speech. Table 2 shows the corpus statistics of this corpus. The training corpus (Train) was used to train the IBM model parameters. The remaining free parameters, i.e. pm and the model scaling factors (Och and Ney, 2002), were adjusted on the development corpus (Dev). The resulting system was evaluated on the test corpus (Test).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Canadian Hansards
</SectionTitle>
      <Paragraph position="0"> Additionally, we carried out experiments on the Canadian Hansards task. This task contains the proceedings of the Canadian parliament, which are kept by law in both French and English. About 3 million parallel sentences of this bilingual data have been made available by the Linguistic Data Consortium (LDC). Here, we use a subset of the data containing only sentences with a maximum length of 30 words.</Paragraph>
      <Paragraph position="1"> Table 3 shows the training and test corpus statistics.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation in Training
</SectionTitle>
    <Paragraph position="0"> In this section, we will investigate for each of the constraints the coverage of the training corpus alignment. For this purpose, we compute the Viterbi alignment of IBM Model 5 with GIZA++ (Och and Ney, 2000). This alignment is produced without any restrictions on word-reorderings. Then, we check for every sentence if the alignment satisfies each of the constraints. The ratio of the number of satisfied alignments and the total number of sentences is referred to as coverage. Tab. 4 shows the results for the Verbmobil task and for the Canadian Hansards task. It contains the results for both translation directions German-English (S!T) and English-German (T!S) for the Verbmobil task and French-English (S!T) and English-French (T!S) for the Canadian Hansards task, respectively.</Paragraph>
    <Paragraph position="1"> For the Verbmobil task, the baseline ITG constraints and the IBM constraints result in a similar coverage. It is about 91% for the German-English translation direction and about 88% for the English-German translation direction. A significantly higher  coverage of about 96% is obtained with the extended ITG constraints. Thus with the extended ITG constraints, the coverage increases by about 8% absolute. null For the Canadian Hansards task, the baseline ITG constraints yield a worse coverage than the IBM constraints. Especially for the English-French translation direction, the ITG coverage of 73.6% is very low. Again, the extended ITG constraints obtained the best results. Here, the coverage increases from about 87% for the IBM constraints to about 96% for the extended ITG constraints.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Translation Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> In our experiments, we use the following error criteria: null + WER (word error rate): The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the target sentence. + PER (position-independent word error rate): A shortcoming of the WER is the fact that it requires a perfect word order. The PER compares the words in the two sentences ignoring the word order.</Paragraph>
      <Paragraph position="1"> + mWER (multi-reference word error rate): For each test sentence, not only a single reference translation is used, as for the WER, but a whole set of reference translations. For each translation hypothesis, the WER to the most similar sentence is calculated (Niessen et al., 2000).</Paragraph>
      <Paragraph position="2"> + BLEU score: This score measures the precision of unigrams, bigrams, trigrams and fourgrams with respect to a whole set of reference translations with a penalty for too short sentences (Papineni et al., 2001). BLEU measures accuracy, i.e. large BLEU scores are better.</Paragraph>
      <Paragraph position="3"> + SSER (subjective sentence error rate): For a more detailed analysis, subjective judgments by test persons are necessary. Each translated sentence was judged by a human examiner according to an error scale from 0.0 to 1.0 (Niessen et al., 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Translation Results
</SectionTitle>
      <Paragraph position="0"> In this section, we will present the translation results for both the IBM constraints and the baseline ITG constraints. We used a single-word based search with IBM Model 4. The initialization for the ITG constraints was done with monotone IBM Model 4 translations. So, the only difference between the two systems are the reordering constraints.</Paragraph>
      <Paragraph position="1"> In Tab. 5 the results for the Verbmobil task are shown. We see that the results on this task are similar. The search with the ITG constraints yields slightly lower error rates.</Paragraph>
      <Paragraph position="2"> Some translation examples of the Verbmobil task are shown in Tab. 6. We have to keep in mind, that the Verbmobil task consists of transcriptions of spontaneous speech. Therefore, the source sentences as well as the reference translations may have an unorthodox grammatical structure. In the first example, the German verb-group (&amp;quot;w&amp;quot;urde vorschlagen&amp;quot;) is split into two parts. The search with the ITG constraints is able to produce a correct translation. With the IBM constraints, it is not possible to translate this verb-group correctly, because the distance between the two parts is too large (more than four words). As we see in the second example, in German the verb of a subordinate clause is placed at the end (&amp;quot;&amp;quot;ubernachten&amp;quot;). The IBM search is not able to perform the necessary long-range reordering, as it is done with the ITG search.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML