File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0604_metho.xml

Size: 29,679 bytes

Last Modified: 2025-10-06 14:15:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0604">
  <Title>Single-Word Based Approach Alignrtlent Templates</Title>
  <Section position="3" start_page="0" end_page="20" type="metho">
    <SectionTitle>
1 Statistical Machine Translation
</SectionTitle>
    <Paragraph position="0"> The goal of machine translation is the translation of a text given in some source language into a target language. We are given a source string f/= fl...fj...fJ, which is to be translated into a target string e{ = el...ei...ex. Among all possible target strings, we will choose the string with the highest probability:</Paragraph>
    <Paragraph position="2"> The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language. Pr(e{) is the language model of the target language, whereas Pr (ff~lel I) is the translation model.</Paragraph>
    <Paragraph position="3"> Many statistical translation models (Vogel et al., 1996; Tillmann et al., 1997; Niessen et al., 1998; Brown et al., 1993) try to model word-to-word correspondences between source and target words. The model is often further restricted that each source word is assigned exactly one target word. These alignment models are sireilar to the concept of Hidden Markov models (HMM) in speech recognition. The alignment mapping is j ~ i = aj from source position j to target position i = aj. The use of this alignment model raises major problems as it fails to capture dependencies between groups of words.</Paragraph>
    <Paragraph position="4"> As experiments have shown it is difficult to handle different word order and the translation of compound nouns* In this paper, we will describe two methods for statistical machine translation extending the baseline alignment model in order to account for these problems. In section 2, we shortly review the single-word based approach described in (Tillmann et al., 1997) with some recently iraplemented extensions allowing for one-to-many alignments. In section 3 we describe the alignment template approach which explicitly models shallow phrases and in doing so tries to overcome the above mentioned restrictions of single-word alignments. The described method is an improvement of (Och and Weber, 1998), resulting in an improved training and a faster search organization. The basic idea is to model two different alignment levels: a phrase level alignment between phrases and a word level alignment between single words within these phrases. Similar aims are pursued by (Alshawi et al., 1998; Wang and Waibel, 1998) but differently approached. In section 4 we compare the two methods using the Verbmobil task.</Paragraph>
  </Section>
  <Section position="4" start_page="20" end_page="20" type="metho">
    <SectionTitle>
2 Single-Word Based Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
2.1 Basic Approach
</SectionTitle>
      <Paragraph position="0"> In this section, we shortly review a translation approach based on the so-called monotonicity requirement (Tillmann et al., 1997). Our aim is to provide a basis for comparing the two different translation approaches presented.</Paragraph>
      <Paragraph position="1"> In Eq. (1), Pr(e~) is the language model, which is a trigram language model in this case.</Paragraph>
      <Paragraph position="2"> For the translation model Pr(flJ\[e{) we make the assumption that each source word is aligned to exactly one target word (a relaxation of this assumption is described in section 2.2). For our model, the probability of alignment aj for position j depends on the previous alignment position aj-1 (Vogel et al., 1996). Using this assumption, there are two types of probabilities: the alignment probabilities denoted by p(aj \[aj-1) and the lexicon probabilities denoted by p(fj\[ea~). The string translation probability can be re-written:</Paragraph>
      <Paragraph position="4"> For the training of the above model parameters, we use the maximum likelihood criterion in the so-called maximum approximation. When aligning the words in parallel texts (for Indo-European lar~guage pairs like Spanish-English, French-English, Italian-German,...), we typically observe a strong localization effect. In many cases, although not always, there is an even stronger restriction: over large portions of the source string, the alignment is monotone. In this approach, we first assume that the alignments satisfy the monotonicity requirement. Within the translation search, we will introduce suitably restricted permutations of the source string, to satisfy this requirement. For the alignment model, the monotonicity property allows only transitions from aj-1 to aj with a jump width 5:5 _-- a s -- aj-1 C {0, 1, 2}. Theses jumps correspond to the following three cases</Paragraph>
      <Paragraph position="6"> repetition): This case corresponds to a target word with two or more aligned source words.</Paragraph>
      <Paragraph position="7"> * 5 = 1 (forward transition = regular alignment): This case is the regular one: a single new target word is generated.</Paragraph>
      <Paragraph position="8"> * 5 = 2 (skip transition = non-Migned word): This case corresponds to skipping a word, i.e. there is a word in the target string with no aligned word in the source string.</Paragraph>
      <Paragraph position="9"> The possible alignments using the monotonicity assumption are illustrated in Fig. 1. Monotone alignments are paths through this uniform trellis structure. Using the concept of</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="5" start_page="20" end_page="21" type="metho">
    <SectionTitle>
SOURCE POSITION
</SectionTitle>
    <Paragraph position="0"> monotone alignments a search procedure can be formulated which is equivalent to finding the best path through a translation lattice, where the following auxiliary quantity is evaluated using dynamic programming: Here, e and e' are Qe,(j, e) probability of the best partial</Paragraph>
    <Paragraph position="2"> the two final words of the hypothesized target string. The auxiliary quantity is evaluated in a position-synchronous way, where j is the processed position in the source string. The result of this search is a mapping: j ~ (aj, ea5 ), where each source word is mapped to a target position aj and a word eaj at this position. For a trigram language model the following DP recursion equation is evaluated:</Paragraph>
    <Paragraph position="4"> p(5) is the alignment probability for the three cases above, p(.\[., .) denoting the trigram language model, e,e~,e&amp;quot;,e m are the four final words which are considered in the dynamic programming taking into account the monotonicity restriction and a trigram language model. The DP equation is evaluated recursively to find the best partial path to each grid point (j, e ~, e). No explicit length model for the length of the generated target string el / given the source string fl J is used during the generation process. The length model is implicitly given by the alignment probabilities. The optimal translation is obtained by carrying out the following optimization: null max{Qe, ( J, e) . p($1e, e')}, el le where J is the length of the input sentence and $ is a symbol denoting the sentence end. The complexity of the algorithm for full search is J-E 4, where E is the size of the target language vocabulary. However, this is drastically reduced by beam-search.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
2.2 One-to-many alignment model
</SectionTitle>
      <Paragraph position="0"> The baseline alignment model does not permit that a source word is aligned with two or more target words. Therefore, lexical correspondences like 'Zahnarzttermin' for dentist's appointment cause problems because a single source word must be mapped on two or more target words. To solve this problem for the alignment in training, we first reverse the translation direction, i. e. English is now the source language, and German is the target language.</Paragraph>
      <Paragraph position="1"> For this reversed translation direction, we perform the usual training and then check the alignment paths obtained in the maximum approximation. Whenever a German word is aligned with a sequence of the adjacent English words, this sequence is added to the English vocabulary as an additional entry. As a result, we have an extended English vocabulary. Using this new vocabulary, we then perform the standard training for the original translation direction. null</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
2.3 Extension to Handle
Non-Monotonicity
</SectionTitle>
      <Paragraph position="0"> Our approach assumes that the alignment is monotone with respect to the word order for the lion's share of all word alignments. For the translation direction German-English the monotonicity constraint is violated mainly with respect to the verb group. In German, the verb group usually consists of a left and a right verbal brace, whereas in English the words of the verb group usually form a sequence of consecutive words. For our DP search, we use a left-to-right beam-search concept having been introduced in speech recognition, where we rely on beam-search as an efficient pruning technique in order to handle potentially huge search spaces.</Paragraph>
      <Paragraph position="1"> Our ultimate goal is speech translation aiming at a tight integration of speech recognition and translation (Ney, 1999). The results presented were obtained by using a quasi-monotone search procedure, which proceeds from left to right along the position of the source sentence but allows for a small number of source positions that are not processed monotonically. The word re-orderings of the source sentence positions were restricted to the words of the German verb group. Details of this approach will be presented elsewhere.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="21" end_page="25" type="metho">
    <SectionTitle>
3 Alignment Template Approach
</SectionTitle>
    <Paragraph position="0"> A general deficiency of the baseline alignment models is that they are only able to model correspondences between single words. A first countermeasure was the refined alignment model described in section 2.2. A more systematic approach is to consider whole phrases rather than single words as the basis for the alignment models. In other words, a whole group of adjacent words in the source sentence may be aligned with a whole group of adjacent words in the target language. As a result the context of words has a greater influence and the changes in word order from source to target language can be learned explicitly.</Paragraph>
    <Section position="1" start_page="21" end_page="23" type="sub_section">
      <SectionTitle>
3.1 The word level alignment:
</SectionTitle>
      <Paragraph position="0"> alignment templates In this section we will describe how we model the translation of shallow phrases.</Paragraph>
      <Paragraph position="2"> Ti: zwei, drei, vier, ffinf, ...</Paragraph>
      <Paragraph position="3">  T2: Uhr T3: vormittags, nachmittags, abends, $1: two, three, four, five .... $2: o'clock $3: in S4: the S5: morning, evening, afternoon, ...  and bilingual word classes. The key element of our translation model are the alignment tempJa(es. An alignment template z is a triple (F, E, A) which describes the alignment A between a source class sequence and a target class sequence E. The alignment A is represented as a matrix with binary values. A matrix element&amp;quot; with value 1 means that the words at the corresponding positions are aligned and the value 0 means that the words are not aligned. If a source word is not aligned to a target word then it is aligned to the empty word e0 which shall be at the imaginary position i = 0. This alignment representation is a generalization of the baseline alignments described in (Brown et al., 1993) and allows for many-to-many alignments.</Paragraph>
      <Paragraph position="4"> The classes used in F and E are automatically trained bilingual classes using the method described in (Och, 1999) and constitute a partition of the vocabulary of source and target language. The class functions .T and E map words to their classes. The use of classes instead of words themselves has the advantage of a better generalization. If there exist classes in source and target language which contain all towns it is possible that an alignment template learned using a special town can be generalized to all towns. In Fig. 2 an example of an alignment template is shown.</Paragraph>
      <Paragraph position="5"> An alignment template z = (F, E, A) is applicable to a sequence of source words \] if the alignment template classes and the classes of the source words are equal: .T(\]) = F. The application of the alignment template z constrains the target words ~ to coffrespond to the target class sequence: E(~) = E.</Paragraph>
      <Paragraph position="6"> The application of an alignment template does not determine the target words, but only constrains them. For the selection of words from classes we use a statistical model for p(SIz,/) based on the lexicon probabilities of a statistical lexicon p(f\[e). We assume a mixture alignment between the source and target language words constrained by the alignment matrix A:</Paragraph>
      <Paragraph position="8"> The phrase level alignment In order to describe the phrase level alignment in a formal way, we first decompose both the source sentence fl J and the target sentence el / into a sequence of phrases (k = 1,..., K): fx g = /1 ~ , fk = fjk-x+l,'&amp;quot;,fjk ef ---- el K , ek ---- eik_l+l,...,eik In order to simplify the notation and the presentation, we ignore the fact that there can be a large number of possible segmentations and assume that there is only one segmentation. In the previous section, we have described the alignment within the phrases. For the alignment 5~&amp;quot; * between the source phrases ~1K and the target phrases/~, we obtain the following equation:</Paragraph>
      <Paragraph position="10"> For the phrase level alignment we use a first-order alignment model p(Sklgl k-~,K) = p(SklSk_l, K) which is in addition constrained to be a permutation of the K phrases.</Paragraph>
      <Paragraph position="11"> For the translation of one phrase, we introduce the alignment template as an unknown variable:</Paragraph>
      <Paragraph position="13"> The probability p(zl~ ) to apply an alignment template gets estimated by relative frequencies (see next section). The probability p(flz, ~) is decomposed by Eq. (2).</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
3.3 Training
</SectionTitle>
      <Paragraph position="0"> In this section we show how we obtain the parameters of our translation model by using a parallel training corpus: 1. We train two HMM alignment models (Vogel et al., 1996) for the two translation directions f ~ e and e ~ f by applying the EM-algorithm. However we do not apply maximum approximation in training, thereby obtaining slightly improved alignments. null 2. For each translation direction we calculate the Viterbi-alignment of the translation models determined in the previous step. Thus we get two alignment vectors al J and bl / for each sentence.</Paragraph>
      <Paragraph position="1"> We increase the quality of the alignments by combining the two alignment vectors into one alignment matrix using the following method. A1 = {(aj,j)\[j = 1... J} and A2 = {(i, bi)li = 1... I} denote the set of links in the two Viterbi-alignments. In a first step the intersection A = A1 n A2 is determined. The elements within A are justified by both Viterbi-alignments and are therefore very reliable. We now extend the alignment A iteratively by adding links (i, j) occurring only in A1 or in A2 if they have a neighbouring link already in A or if neither the word fj nor the word ei are aligned in A. The alignment (i, j) has the neighbouring links (i - 1,j), (i,j - 1), (i + 1, j), and (i, j + 1). In the Verbmobil task (Table 1) the precision of the baseline Viterbi alignments is 83.3 percent with English as source language and 81.8 percent with German as source language. Using this heuristic we get an alignment matrix with a precision of 88.4 percent without loss in recall.</Paragraph>
      <Paragraph position="2"> 3. We estimate a bilingual word lexicon p(fle) by the relative frequencies of the alignment determined in the previous step: p(fle) _ hA(f, e) (6) n(e) Here nA(f,e) is the frequency that the word f is aligned to e and n(e) is the frequency of e in the training corpus.</Paragraph>
      <Paragraph position="3"> 4. We determine word classes for source and target language. A naive approach for doing this would be the use of monolingually optimized word classes in source and target language. Unfortunately we can not expect that there is a direct correspondence between independently optimized classes.</Paragraph>
      <Paragraph position="4"> Therefore monolingually optimized word classes do not seem to be useful for machine translation.</Paragraph>
      <Paragraph position="5"> We determine correlated bilingual classes by using the method described in (Och, 1999). The basic idea of this method is to apply a maximum-likelihood approach to the joint probability of the parallel training corpus. The resulting optimization criterion for the bilingual word classes is similar to the one used in monolingual maximum-likelihood word clustering.</Paragraph>
      <Paragraph position="6"> 5. We count all phrase-pairs of the training corpus which are consistent with the alignment matrix determined in step 2. A phrase-pair is consistent with the alignment if the words within the source phrase are only aligned to words within the target phrase. Thus we obtain a count n(z) of how often an alignment template occurred in the aligned training corpus. The probability of using an alignment template needed by Eq. (5) is estimated by relative frequency:</Paragraph>
      <Paragraph position="8"> Fig. 3 shows some of the extracted alignment templates. The extraction algorithm  th. ........... I &amp;quot;1&amp;quot; in ........... I &amp;quot;1&amp;quot; o.oloo ........... l&amp;quot; I&amp;quot; two ........... 1* l&amp;quot; maybe .........</Paragraph>
      <Paragraph position="10"> does not perform a selection of good or bad alignment templates - it simply extracts all possible alignment templates.</Paragraph>
    </Section>
    <Section position="3" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
3.4 Search
</SectionTitle>
      <Paragraph position="0"> For decoding we use the following search criterion: null arg max {p(e~).p(e~lf~)) (8) 4 This decision rule is an approximation to Eq. (1) which would use the translation probability p(flJle{). Using the simplification it is easy to integrate translation and language model in the search process as both models predict target words. As experiments have shown this simplification does not affect the quality of translation results.</Paragraph>
      <Paragraph position="1"> To allow the influence of long contexts we use a class-based five-gram language model with backing-off.</Paragraph>
      <Paragraph position="2"> The search space denoted by Eq. (8) is very large. Therefore we apply two preprocessing steps before the translation of a sentence: 1. We determine the set of all source phrases in f for which an applicable alignment template exists. Every possible application of an alignment template to a sub-sequence of the source sentence is called alignment template instantiation. 2. We now perform a segmentation of the input sentence. We search for a sequence of phrases fl o...o/k = fl J with:</Paragraph>
      <Paragraph position="4"> arg max II maxz p(zlfk ) (9) \]lO...oh=:: k=l This is done efficiently by dynamic programming. Because of the simplified decision rule (Eq. (8)) it is used in Eq. (9) p(z\]fk) instead of p(z\]~k).</Paragraph>
      <Paragraph position="5"> Afterwards the actual translation process begins. It has a search organization along the positions of the target language string. In search we produce partial hypotheses, each of which contains the following information:  1. the last target word produced, 2. the state of the language model (the classes of the last four target words), 3. a bit-vector representing the already covered positions of the source sentence, 4. a reference to the alignment template instantiation which produced the last target word, 5. the position of the last target word in the alignment template instantiation, 6. the accumulated costs (the negative loga null rithm of the probabilities) of all previous decisions, 7. a reference to the previous partial hypothesis. null A partial hypothesis is extended by appending one target word. The set of all partial hypotheses can be structured as a graph with a source node representing the sentence start, leaf nodes representing full translations and intermediate nodes representing partial hypotheses. We recombine partial hypotheses which cannot be distinguished by neither language model nor translation model. When the elements 1 - 5 of two partial hypotheses do not allow to distinguish between two hypotheses it is possible to drop the hypothesis with higher costs for the subsequent search process.</Paragraph>
      <Paragraph position="6"> We also use beam-search in order to handle the huge search space. We compare in beam-search hypotheses which cover different parts of  the input sentence. This makes the comparison of the costs somewhat problematic. Therefore we integrate an (optimistic) estimation of the remaining costs to arrive at a full translation. This can be done efficiently by determining in advance for each word in the source language sentence a lower bound for the costs of the translation of this word. Together with the bit-vector stored in a partial hypothesis it is possible to achieve an efficient estimation of the remaining costs.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="25" end_page="34465" type="metho">
    <SectionTitle>
4 Translation results
</SectionTitle>
    <Paragraph position="0"> The &amp;quot;Verbmobil Task&amp;quot; (Wahlster, 1993) is a speech translation task in the domain of appointment scheduling, travel planning, and hotel reservation. The task is difficult because it consists of spontaneous speech and the syntactic structures of the sentences are less restricted and highly variable.</Paragraph>
    <Paragraph position="1"> The translation direction is from German to English which poses special problems due to the big difference in the word order of the two languages. We present results on both the text transcription and the speech recognizer output using the alignment template approach and the single-word based approach.</Paragraph>
    <Paragraph position="2"> The text input was obtained by manually transcribing the spontaneously spoken sentences. There was no constraint on the length of the sentences, and some of the sentences in the test corpus contain more than 50 words. Therefore, for text input, each sentence is split into shorter units using the punctuation marks. The segments thus obtained were translated separately, and the final translation was obtained by concatenation.</Paragraph>
    <Paragraph position="3"> In the case of speech input, the speech recognizer along with a prosodic module produced so-called prosodic markers which are equivalent to punctuation marks in written language. The experiments for speech input were performed on the single-best sentence of the recognizer. The recognizer had a word error rate of 31.0%. Considering only the real words without the punctuation marks, the word error rate was smaller, namely 20.3%.</Paragraph>
    <Paragraph position="4"> A summary of the corpus used in the experiments is given in Table 1. Here the term word refers to full-form word as there is no morphological processing involved. In some of our experiments we use a domain-specific preprocessing which consists of a list of 803 (for German) and 458 (for English) word-joinings and wordsplittings for word compounds, numbers, dates and proper names. To improve the lexicon probabilities and to account for unseen words we added a manually created German-English dictionary with 13 388 entries. The classes used were constrained so that all proper names were included in a single class. Apart from this, the classes were automatically trained using the described bilingual clustering method. For each of the two languages 400 classes were used.</Paragraph>
    <Paragraph position="5"> For the single-word based approach, we used the manual dictionary as well as the preprocessing steps described above. Neither the translation model nor the language model used classes in this case. In principal, when re-ordering words of the source string, words of the German verb group could be moved over punctuation marks, although it was penalized by a constant cost.</Paragraph>
    <Paragraph position="6">  Verbmobil task. The extended vocabulary includes the words of the manual dictionary. The trigram perplexity (PP) is given.</Paragraph>
    <Paragraph position="7">  The WER is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated string into the target string. This performance criterion is</Paragraph>
    <Paragraph position="9"> rate): A shortcoming of the WER is the fact that it requires a perfect word order. This is  independent word error rate (PER) and subjective sentence error rate (SSER) with/without pre-processing (147 sentences = 1 968 words of the Verbmobil task).  particularly a problem for the Verbmobil task, where the word order of the German-English sentence pair can be quite different. As a result, the word order of the automatically generated target sentence can be different from that of the target sentence, but nevertheless acceptable so that the WER measure alone could be misleading. In order to overcome this problem, we introduce as additional measure the position-independent word error rate (PER). This measure compares the words in the two sentences without taking the word order into account. Words that have no matching counterparts are counted as substitution errors. Depending on whether the translated sentence is longer or shorter than the target translation, the remaining words result in either insertion or deletion errors in addition to substitution errors. The PER is guaranteed to be less than or equal to the WER.</Paragraph>
    <Paragraph position="10"> SSER (subjective sentence error rate): For a more detailed analysis, subjective judgments by test persons are necessary.</Paragraph>
    <Paragraph position="11"> Each translated sentence was judged by a human examiner according to an error scale from 0.0 to 1.0. A score of 0.0 means that the translation is semantically and syntactically correct, a score of 0.5 means that a sentence is semantically correct but syntactically wrong and a score of 1.0 means that the sent6nce is semantically wrong. The human examiner was offered the translated sentences of the two approaches at the same time. As a result we expect a better possibility of reproduction.</Paragraph>
    <Paragraph position="12"> The results of the translation experiments using the single-word based approach and the alignment template approach on text input and on speech input are summarized in Table 2. The results are shown with and without the use of domain-specific preprocessing. The alignment template approach produces better translation results than the single-word based approach.</Paragraph>
    <Paragraph position="13"> From this we draw the conclusion that it is important to model word groups in source and target language. Considering the recognition word error rate of 31% the degradation of about 20% by speech input can be expected. The average translation time on an Alpha workstation for a single sentence is about one second for the alignment template apprbach and 30 seconds for the single-word based search procedure.</Paragraph>
    <Paragraph position="14"> Within the Verbmobil project other translation modules based on rule-based, example-based and dialogue-act-based translation are used. We are not able to present results with these methods using our test corpus. But in the current Verbmobil prototype the preliminary evaluations show that the statistical methods produce comparable or better results than the other systems. An advantage of the system is that it is robust and always produces a translation result even if the input of the speech recognizer is quite incorrect.</Paragraph>
  </Section>
  <Section position="8" start_page="34465" end_page="34465" type="metho">
    <SectionTitle>
5 Summary
</SectionTitle>
    <Paragraph position="0"> We have described two approaches to perform statistical machine translation which extend the baseline alignment models. The single-word  based approach allows for the the possibility of one-to-many alignments. The alignment template approach uses two different alignment levels: a phrase level alignment between phrases and a word level alignment between single words. As a result the context of words has a greater influence and the changes in word order from source to target language can be learned explicitly. An advantage of both methods is that they learn fully automatically by using a bilingual training corpus and are capable of achieving better translation results on a limited-domain task than other example-based or rule-based translation systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML