File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2158_metho.xml
Size: 17,457 bytes
Last Modified: 2025-10-06 14:15:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2158"> <Title>A DP based Search Algorithm for Statistical Machine Translation</Title> <Section position="3" start_page="961" end_page="963" type="metho"> <SectionTitle> 2 DP Search </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="961" end_page="961" type="sub_section"> <SectionTitle> 2.1 The Inverted Alignment Model </SectionTitle> <Paragraph position="0"> For our search method, we chose an algorithm which is based on dynamic programming. Compared to an A'-based algorithm dynamic programming has the fundamental advantage, that solutions of subproblems are stored and can then be re-used in later stages of the search process. However, for the optimization criterion considered here dynamic programming is only suboptimal because the decomposition into independent subproblems is only approximately possible: to prevent the search time of a search algorithm from increasing exponentially with the string lengths and vocabulary sizes, local decisions have to be made at an earlier stage of the optimization process that might turn out to be suboptimal in a later stage but cannot be altered then. As a consequence, the global optimum might be missed in some cases.</Paragraph> <Paragraph position="1"> The search algorithm we present here combines the advantages of dynamic programming with the search organization along the positions of the target string, which allows the integration of the bigram in a very natural way without restricting the alignment paths to the class of monotone alignments.</Paragraph> <Paragraph position="2"> The alignment model as described above is defined as a function that assigns exactly one target word to each source word. We introduce a new interpretation of the alignment model: Each position i in e / is assigned a position bi = j in fl J. Fig. 2 illustrates the possible transitions in this inverted model.</Paragraph> <Paragraph position="3"> At each position i of el, each word of the target language vocabulary can be inserted. In addition, the fertility l must be chosen: A position i and the word ei at this position are considered to correspond to a sequence of words f~:+1-t in f\]. In most cases, the optimal fertility is 1. It is also possible, that a word ei has fertility 0, which means that there is no directly corresponding word in the source string. We call this a skip, because the position i is skipped in the alignment path.</Paragraph> <Paragraph position="4"> Using a bigram language model, Eq. (9) specifies the modified search criterion for our algorithm. Here as above, we assume the maximum approximation to be valid.</Paragraph> <Paragraph position="6"/> <Paragraph position="8"> For better legibility, we regard the second product in Eq. (9) to be equal to 1, ifl = 0. It should be stressed that the pair (I,e{) optimizing Eq. (9) is not guaranteed to be also optimal in terms of the original criterion (6).</Paragraph> </Section> <Section position="2" start_page="961" end_page="961" type="sub_section"> <SectionTitle> 2.2 Basic Problem: Position Coverage </SectionTitle> <Paragraph position="0"> .4. closer look at Eq. (9) reveals the most important problem of the search organization along the target string positions: It is not guaranteed, that all the words in the source string are considered. In other words we have to force the algorithm to cover all input string positions. Different strategies to solve this problem are possible: For example, we can introduce a reward for covering a position, which has not yet been covered. Or a penalty can be imposed for each position without correspondence in the target string.</Paragraph> <Paragraph position="1"> In preliminary experiments, we found that the most promising method to satisfy the position coverage constraint is the introduction of an additional parameter into the recursion formula for DP. In the following, we will explain this method in detail.</Paragraph> </Section> <Section position="3" start_page="961" end_page="962" type="sub_section"> <SectionTitle> 2.3 Recursion Formula for DP </SectionTitle> <Paragraph position="0"> In the DP formalism, the search process is described recursively. Assuming a total length I of the target string, Q1(c, i, j, e) is the probability of the best partial path ending in the coordinates i in el / and j in f J, if the last word ei is e and if c positions in the source string have been covered.</Paragraph> <Paragraph position="1"> This quantity is defined recursively. Leaving a word ei without any assignment (skip) is the easiest case: QS (c, i, j, e) = max {p(ele')Q1(c, i - 1, j, d)} . Note that it is not necessary to maximize over the predecessor positions jr: This maximization is subsumed by the maximization over the positions on the next level, as can easily be proved.</Paragraph> <Paragraph position="2"> In the original criterion (6), each position j in the source string is aligned to exactly one target string position i. Hence, if i is assigned to I subsequent positions in fl s, we want to verify that none of these positions has already been covered: We define a control function v which returns 1 if the above constraint is satisfied and 0 otherwise. Then we can write:</Paragraph> <Paragraph position="4"> We now have to find the maximum: Q,(c, i, j, e) = max {QS(c, i, j, e), Qn(c, i, j, e)} . The decisions made during the dynamic programming process (choices of l, j' and e ~) are stored for recovering the whole translation hypothesis.</Paragraph> <Paragraph position="5"> The best translation hypothesis can be found by optimizing the target string length I and requiring the number of covered positions to be equal to the source string length J:</Paragraph> <Paragraph position="7"/> </Section> <Section position="4" start_page="962" end_page="963" type="sub_section"> <SectionTitle> 2.4 Acceleration Techniques </SectionTitle> <Paragraph position="0"> The time comple.,dty of the translation method as described above is</Paragraph> <Paragraph position="2"> where I~\] is the size of the target language vocabulary C. Some refinements of this algorithm have been implemented to increase the translation speed.</Paragraph> <Paragraph position="3"> 1. We can expect the progression of the source string coverage to be roughly proportional to the progression of the translation procedure along the target string. So it is legitimate to define a minimal and maximal coverage for each level i: Cmin(i)= \[iJJ -r, Cmax(i)= \[i~\] + r , where r is a constant integer number. In preliminary experiments we found that we could set r to 3 without any loss in translation accuracy.</Paragraph> <Paragraph position="4"> This reduces the time complexity by a factor J.</Paragraph> <Paragraph position="5"> 2. Optimizing the target string length as formulated in Eq. (10) requires the dynamic programming procedure to start all over again for each I. If we assume the dependence of the alignment probabilities p(ilj, J, I) on I to be negligible, we can renormalize them by using an estimated target string length/~ and use p(ilj , J, I). Now we can produce one translation e~ at each</Paragraph> <Paragraph position="7"> For/~ we choose: /~ = \](J) = J-/~- where pC/ and #j denote the average lengths of the target and source strings, respectively.</Paragraph> <Paragraph position="8"> This approximation is partly undone by what we call rescoring: For each translation hypothesis e / with length I, we compute the &quot;true&quot; score (~(I) by searching the best inverted alignment given e / and fs and evaluating the probabilities along this alignment. Hence, we finally find the best translation via Eq. (12):</Paragraph> <Paragraph position="10"> The time complexity for this additional step is negligible, since there is no optimization over the English words, which is the dominant factor in the overall time complexity</Paragraph> <Paragraph position="12"> 3. We introduced two thresholds: SL&quot; If e' is the predecessor word of e and e is not aligned to the source string (&quot;skip&quot;), then p(eie') must be higher than SL.</Paragraph> <Paragraph position="13"> ST&quot; A word e can only be associated with a source language word f, if p(f\[e) is higher than ST.</Paragraph> <Paragraph position="14"> This restricts the optimization over the target language vocabulary to a relatively small set of candidate words. The resulting time complexity is O(Im~x. J2-IEI).</Paragraph> <Paragraph position="15"> 4. When searching for the best partial path to a gridpoint G = (c,i,j,e), we can sort the arcs leading to G in a specific manner that allows us to stop the computation whenever it becomes clear that no better partial path to G exists. The effect of this measure depends on the quality of the used models; in preliminary experiments we observed a speed-up factor of about 3.5.</Paragraph> </Section> </Section> <Section position="4" start_page="963" end_page="965" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> The search algorithm suggested in this paper was tested on the Verbmobil Corpus. The results of preliminary tests on a small automatically generated Corpus (Amengual et al., 1996) were quite promising and encouraged us to apply our search algorithm to a more realistic task.</Paragraph> <Paragraph position="1"> The Verbmobil Corpus consists of spontaneously spoken dialogs in the domain of appointment scheduling (Wahlster, 1993). German source sentences are translated into English. In Table 1 the characteristics of the training and test sets are summarized. The vocabularies include category labels for dates, proper names, numbers, times, names of places and spellings. The model parameters were trained on 16 296 sentence pairs, where names etc. had been replaced by the appropriate labels.</Paragraph> <Paragraph position="2"> mobil task.</Paragraph> <Paragraph position="3"> formed sample translations (i.e. after labelling) was 13.8.</Paragraph> <Paragraph position="4"> In preliminary evaluations, optimal values for the thresholds OL and OT had been determined and kept fixed during the experiments.</Paragraph> <Paragraph position="5"> As an automatic and easy-to-use measure of the translation performance, the Levenshtein distance between the produced translations and the sample translations was calculated. The translation results are summarized in Table 2.</Paragraph> <Paragraph position="6"> Given the vocabulary sizes, it becomes quite obvious that the lexicon probabilities p(f\[e) can not be trained sufficiently on only 16 296 sentence pairs. The fact that about 40% of the words in the lexicon are seen only once in training illustrates this. To improve the lexicon probabilities, we interpolated them with lexicon probabilities pM(fle) manually created from a German-English dictionary: {'o ~ if (e, f) is in the dictionary pM(fle) _ otherwise ' where Ne is the number of German words listed as translations of the English word e. The two lexica were combined by linear interpolation with the interpolation parameter A. For our first experiments, we set A to 0.5.</Paragraph> <Paragraph position="7"> The test corpus consisted of 150 sentences, for which sample translations exist. The labels were translated separately: First, the test sentences were preprocessed in order to replace words or groups of words by the correct category label. Then, our search algorithm translated the transformed sentences. In the last step, a simple rule-based algorithm replaced the category labels by the translations of the original words.</Paragraph> <Paragraph position="8"> We used a bigram language model for the English language. Its perplexity on the corpus of trans(Tillmann et al., 1997a) report a word error rate of 51.8% on similar data.</Paragraph> <Paragraph position="9"> Although the Levenshtein distance has the great advantage to be automatically computable, we have to keep in mind, that it depends fundamentally on the choice of the sample translation. For example, each of the expressions &quot;thanks&quot;, &quot;thank you&quot; and &quot;thank you very much&quot; is a legitimate translation of the German &quot;danke schSn&quot;, but when calculating the Levenshtein distance to a sample translation, at least two of them will produce word errors. The more words the vocabulary contains, the more important will be the problem of synonyms.</Paragraph> <Paragraph position="10"> This is why we also asked five experts to classify independently the produced translations into three categories, being the same as in (Wang and Waibel, 1997): Correct translations are grammatical and convey the same meaning as the input.</Paragraph> <Paragraph position="11"> Acceptable translations convey the same meaning but with small grammatical mistakes or they convey most but not the entire meaning of the input.</Paragraph> <Paragraph position="12"> Incorrect translations are ungrammatical or convey little meaningful information or the information is different from the input.</Paragraph> <Paragraph position="13"> Examples for each category are given in Table 3. Table 4 shows the statistics of the translation performance. When different judgements existed for one sentence, the majority vote was accepted.</Paragraph> <Paragraph position="14"> For the calculation of the subjective sentence error rate (SSER), translations from the second category counted as &quot;half-correct&quot;.</Paragraph> <Paragraph position="15"> When evaluating the performance of a statistical machine translator, we would like to distinguish errors due to the weakness of the underlying models Ja, also mit Dienstag und mittwochs und so h/itte ich Zeit, aber Montag kommen wir hier nicht weg aus Kiel.</Paragraph> <Paragraph position="16"> Yes, and including on Tuesday and Wednesday as well, I have time on Monday but we will come to be away from Kiel.</Paragraph> <Paragraph position="17"> Input: Dann fahren wir da los.</Paragraph> <Paragraph position="18"> Output: We go out.</Paragraph> <Paragraph position="19"> performance on Verbmobil: number of sentences evaluated as Correct (C), Acceptable (A) or Incorrect (I). For the total percentage of non-correct translations (SSER), the &quot;acceptable&quot; translations are counted as half-errors.</Paragraph> <Paragraph position="20"> I Total Correct Acceptable Incorrect SSER I 150 61 45 44 44.3% from search errors, occuring whenever the search algorithm misses a translation hypothesis with a higher score. Unfortunately, we can never be sure that a search error does not occur, because we do not know whether or not there is another string with an even higher score than the produced output.</Paragraph> <Paragraph position="21"> Nevertheless, it is quite interesting to compare the score of the algorithm's output and the score of the sample translation in such cases in which the output is not correct (it is classified as &quot;acceptable&quot; or &quot;incorrect&quot; ).</Paragraph> <Paragraph position="22"> The original value to be maximized by the search algorithm (see Eq. (6)) is the score as defined by the underlying models and described by Eq. (13).</Paragraph> <Paragraph position="23"> J Pr(e~).p(JII) H max ~(ilj , J, I). p(fjlei)\] * (13) j=l ie\[1,1\] We calculated this score for the sample translations as well as for the automatically generated translations. Table 5 shows the result of the comparison. In most cases, the incorrect outputs have higher scores than the sample translations, which leads to the conclusion that the improvement of the models (stronger language model for the target language, better translation model and especially more training data) will have a strong impact on the quality of the produced translations. The other cases, i. e. those in which the models prefer the sample translations to the produced output, might be due to the difference of the original search criterion (6) and the criterion (9), which is the basis of our search algorithm. The approximation made by the introduction of the parameters OT and OL is an additional reason for search errors.</Paragraph> <Paragraph position="24"> Table 5: Comparison: Score of Reference Translation e and Translator Output e ~ for &quot;acceptable&quot; translations (A) and &quot;incorrect&quot; translations (I). For the total number of non-correct translations (T), the &quot;acceptable&quot; translations are counted as half-errors.</Paragraph> <Paragraph position="25"> As far as we know, only two recent papers have dealt with decoding problem for machine translation systems that use translation models based on hidden alignments without a monotonicity constraint: (Berger et al., !994) and (Wang and Waibel, 1997). The former uses data sets that differ significantly from the Verbmobil task and hence, the reported results cannot be compared to ours. The latter presents experiments carried out on a corpus corn- null parable to our test data in terms of vocabulary sizes, domain and number of test sentences. The authors report a subjective sentence error rate which is in the same range as ours. An exact comparison is only possible if exactly the same training and testing data are used and if all the details of the search algorithms are considered.</Paragraph> </Section> class="xml-element"></Paper>