Improved Alignment Models for Statistical Machine Translation 
Franz Josef Och, Christoph Tillmann, and Hermann Ney 
Lehrstuhl fiir Informatik VI 
RWTH Aachen - University of Technology 
Ahornstrafie 55; 52056 Aachen; GERMANY {och, 
t illmann, ney}@inf ormat ik. rwth-aachen, de 
Abstract 
In this paper, we describe improved alignment 
models for statistical machine translation. The 
statistical translation approach uses two types 
of information: a translation model and a lan- 
guage model. The language model used is a 
bigram or general m-gram model. The transla- 
tion model is decomposed into a lexical and an 
alignment model. We describe two different ap- 
proaches for statistical translation and present 
experimental results. The first approach is 
based on dependencies between single words, 
the second approach explicitly takes shallow 
phrase structures into account, using two differ- 
ent alignment levels: a phrase level alignment 
between phrases and a word level alignment 
between single words. We present results us- 
ing the Verbmobil task (German-English, 6000- 
word vocabulary) which is a limited-domain 
spoken-language task. The experimental tests 
were performed on both the text transcription 
and the speech recognizer output. 
1 Statistical Machine Translation 
The goal of machine translation is the transla- 
tion of a text given in some source language into 
a target language. We are given a source string 
f/= fl...fj...fJ, which is to be translated into 
a target string e{ = el...ei...ex. Among all possi- 
ble target strings, we will choose the string with 
the highest probability: 
= argmax {Pr(ezIlflJ)} 
e 1 
= argmax {Pr(e\[). Pr(f/le~) } • (1) 
The argmax operation denotes the search prob- 
lem, i.e. the generation of the output sentence 
in the target language. Pr(e{) is the language 
model of the target language, whereas Pr (ff~lel I) 
is the translation model. 
Many statistical translation models (Vogel et 
al., 1996; Tillmann et al., 1997; Niessen et al., 
1998; Brown et al., 1993) try to model word-to- 
word correspondences between source and tar- 
get words. The model is often further restricted 
that each source word is assigned exactly one 
target word. These alignment models are sire- 
ilar to the concept of Hidden Markov models 
(HMM) in speech recognition. The alignment 
mapping is j ~ i = aj from source position j 
to target position i = aj. The use of this align- 
ment model raises major problems as it fails to 
capture dependencies between groups of words. 
As experiments have shown it is difficult to han- 
dle different word order and the translation of 
compound nouns• 
In this paper, we will describe two methods 
for statistical machine translation extending the 
baseline alignment model in order to account 
for these problems. In section 2, we shortly re- 
view the single-word based approach described 
in (Tillmann et al., 1997) with some recently ira- 
plemented extensions allowing for one-to-many 
alignments. In section 3 we describe the align- 
ment template approach which explicitly mod- 
els shallow phrases and in doing so tries to over- 
come the above mentioned restrictions of single- 
word alignments. The described method is an 
improvement of (Och and Weber, 1998), result- 
ing in an improved training and a faster search 
organization. The basic idea is to model two 
different alignment levels: a phrase level align- 
ment between phrases and a word level align- 
ment between single words within these phrases. 
Similar aims are pursued by (Alshawi et al., 
1998; Wang and Waibel, 1998) but differently 
approached. In section 4 we compare the two 
methods using the Verbmobil task. 
20 
2 Single-Word Based Approach 
2.1 Basic Approach 
In this section, we shortly review a translation 
approach based on the so-called monotonicity 
requirement (Tillmann et al., 1997). Our aim is 
to provide a basis for comparing the two differ- 
ent translation approaches presented. 
In Eq. (1), Pr(e~) is the language model, 
which is a trigram language model in this case. 
For the translation model Pr(flJ\[e{) we make 
the assumption that each source word is aligned 
to exactly one target word (a relaxation of this 
assumption is described in section 2.2). For 
our model, the probability of alignment aj for 
position j depends on the previous alignment 
position aj-1 (Vogel et al., 1996). Using this 
assumption, there are two types of probabil- 
ities: the alignment probabilities denoted by 
p(aj \[aj-1) and the lexicon probabilities denoted 
by p(fj\[ea~). The string translation probability 
can be re-written: 
Pr(flJ\[elI) = E H \[p(ajlaj_l)'p(fj\[ea~)\] 
a~ 
For the training of the above model parame- 
ters, we use the maximum likelihood criterion in 
the so-called maximum approximation. When 
aligning the words in parallel texts (for Indo- 
European lar~guage pairs like Spanish-English, 
French-English, Italian-German,...), we typi- 
cally observe a strong localization effect. In 
many cases, although not always, there is an 
even stronger restriction: over large portions 
of the source string, the alignment is mono- 
tone. In this approach, we first assume that 
the alignments satisfy the monotonicity require- 
ment. Within the translation search, we will in- 
troduce suitably restricted permutations of the 
source string, to satisfy this requirement. For 
the alignment model, the monotonicity property 
allows only transitions from aj-1 to aj with a 
jump width 5:5 _-- a s -- aj-1 C {0, 1, 2}. Theses 
jumps correspond to the following three cases 
(5 = 0, 1,2): 
• 5 ---- 0 (horizontal transition = alignment 
repetition): This case corresponds to a tar- 
get word with two or more aligned source 
words. 
• 5 = 1 (forward transition = regular align- 
ment): This case is the regular one: a single 
new target word is generated. 
• 5 = 2 (skip transition = non-Migned word): 
This case corresponds to skipping a word, 
i.e. there is a word in the target string with 
no aligned word in the source string. 
The possible alignments using the monotonic- 
ity assumption are illustrated in Fig. 1. Mono- 
tone alignments are paths through this uni- 
form trellis structure. Using the concept of 
I I I I I I 1 2 3 4 5 6 
SOURCE POSITION 
Figure 1: Illustration of alignments for the 
monotone HMM. 
monotone alignments a search procedure can be 
formulated which is equivalent to finding the 
best path through a translation lattice, where 
the following auxiliary quantity is evaluated us- 
ing dynamic programming: Here, e and e' are 
Qe,(j, e) probability of the best partial 
hypothesis (e~,a~) with ei = e, 
ei-1 = e ~ and aj = i. 
the two final words of the hypothesized target 
string. The auxiliary quantity is evaluated in 
a position-synchronous way, where j is the pro- 
cessed position in the source string. The result 
of this search is a mapping: j ~ (aj, ea5 ), where 
each source word is mapped to a target posi- 
tion aj and a word eaj at this position. For a 
trigram language model the following DP recur- 
sion equation is evaluated: 
21 
Q~,(j, e) = p(fj\]e) . max{ 
p(O). Qe'(J - 1, e), 
p(1) . n~ax{p(ele', e") . Q~,,(j - 1, e')} 
p(2). ~ {p(e\[e', e"). p(e'\[e", e'") 
• Qe,,, (J - 1, e")}} 
p(5) is the alignment probability for the three 
cases above, p(.\[., .) denoting the trigram lan- 
guage model, e,e~,e",e m are the four final 
words which are considered in the dynamic pro- 
gramming taking into account the monotonicity 
restriction and a trigram language model. The 
DP equation is evaluated recursively to find the 
best partial path to each grid point (j, e ~, e). No 
explicit length model for the length of the gen- 
erated target string el / given the source string 
fl J is used during the generation process. The 
length model is implicitly given by the align- 
ment probabilities. The optimal translation is 
obtained by carrying out the following optimiza- 
tion: 
max{Qe, ( J, e) . p($1e, e')}, el le 
where J is the length of the input sentence and 
$ is a symbol denoting the sentence end. The 
complexity of the algorithm for full search is 
J-E 4, where E is the size of the target language 
vocabulary. However, this is drastically reduced 
by beam-search. 
2.2 One-to-many alignment model 
The baseline alignment model does not per- 
mit that a source word is aligned with two or 
more target words. Therefore, lexical corre- 
spondences like 'Zahnarzttermin' for dentist's 
appointment cause problems because a single 
source word must be mapped on two or more 
target words. To solve this problem for the 
alignment in training, we first reverse the trans- 
lation direction, i. e. English is now the source 
language, and German is the target language. 
For this reversed translation direction, we per- 
form the usual training and then check the 
alignment paths obtained in the maximum ap- 
proximation. Whenever a German word is 
aligned with a sequence of the adjacent English 
words, this sequence is added to the English vo- 
cabulary as an additional entry. As a result, 
we have an extended English vocabulary. Using 
this new vocabulary, we then perform the stan- 
dard training for the original translation direc- 
tion. 
2.3 Extension to Handle 
Non-Monotonicity 
Our approach assumes that the alignment is 
monotone with respect to the word order for 
the lion's share of all word alignments. For 
the translation direction German-English the 
monotonicity constraint is violated mainly with 
respect to the verb group. In German, the verb 
group usually consists of a left and a right ver- 
bal brace, whereas in English the words of the 
verb group usually form a sequence of consec- 
utive words. For our DP search, we use a left- 
to-right beam-search concept having been intro- 
duced in speech recognition, where we rely on 
beam-search as an efficient pruning technique in 
order to handle potentially huge search spaces. 
Our ultimate goal is speech translation aim- 
ing at a tight integration of speech recognition 
and translation (Ney, 1999). The results pre- 
sented were obtained by using a quasi-monotone 
search procedure, which proceeds from left to 
right along the position of the source sentence 
but allows for a small number of source posi- 
tions that are not processed monotonically. The 
word re-orderings of the source sentence posi- 
tions were restricted to the words of the Ger- 
man verb group. Details of this approach will 
be presented elsewhere. 
3 Alignment Template Approach 
A general deficiency of the baseline alignment 
models is that they are only able to model corre- 
spondences between single words. A first coun- 
termeasure was the refined alignment model de- 
scribed in section 2.2. A more systematic ap- 
proach is to consider whole phrases rather than 
single words as the basis for the alignment mod- 
els. In other words, a whole group of adjacent 
words in the source sentence may be aligned 
with a whole group of adjacent words in the tar- 
get language. As a result the context of words 
has a greater influence and the changes in word 
order from source to target language can be 
learned explicitly. 
3.1 The word level alignment: 
alignment templates 
In this section we will describe how we model 
the translation of shallow phrases. 
22 
• 3. • mmm 
• • m • • • 
T1 m .... 
Ti: zwei, drei, vier, ffinf, ... 
T2: Uhr 
T3: vormittags, nachmittags, abends, 
$1: two, three, four, five .... 
$2: o'clock 
$3: in 
S4: the 
S5: morning, evening, afternoon, ... 
Figure 2: Example of an alignment template 
and bilingual word classes. 
The key element of our translation model are 
the alignment tempJa(es. An alignment tem- 
plate z is a triple (F, E, A) which describes the 
alignment A between a source class sequence 
and a target class sequence E. 
The alignment A is represented as a matrix 
with binary values. A matrix element" with 
value 1 means that the words at the correspond- 
ing positions are aligned and the value 0 means 
that the words are not aligned. If a source word 
is not aligned to a target word then it is aligned 
to the empty word e0 which shall be at the imag- 
inary position i = 0. This alignment represen- 
tation is a generalization of the baseline align- 
ments described in (Brown et al., 1993) and al- 
lows for many-to-many alignments. 
The classes used in F and E are automati- 
cally trained bilingual classes using the method 
described in (Och, 1999) and constitute a parti- 
tion of the vocabulary of source and target lan- 
guage. The class functions .T and E map words 
to their classes. The use of classes instead of 
words themselves has the advantage of a better 
generalization. If there exist classes in source 
and target language which contain all towns it 
is possible that an alignment template learned 
using a special town can be generalized to all 
towns. In Fig. 2 an example of an alignment 
template is shown. 
An alignment template z = (F, E, A) is ap- 
plicable to a sequence of source words \] if the 
alignment template classes and the classes of the 
source words are equal: .T(\]) = F. The appli- 
cation of the alignment template z constrains 
the target words ~ to coffrespond to the target 
class sequence: E(~) = E. 
The application of an alignment template 
does not determine the target words, but only 
constrains them. For the selection of words from 
classes we use a statistical model for p(SIz,/) 
based on the lexicon probabilities of a statistical 
lexicon p(f\[e). We assume a mixture alignment 
between the source and target language words 
constrained by the alignment matrix A: 
p(\]I(F,E,A),~) = ~(E(~),k)5(7(/),F). 
I 
II P(fjiA,~) (2) 
j=l 
I 
p(fjlA, 8) = Ep(i\[j;A).p(fj\[ei)(3) 
i=O 
A(i,j) 
= EiA(i,j) (4) 
3.2 
p(ilj; A) 
The phrase level alignment 
In order to describe the phrase level alignment 
in a formal way, we first decompose both the 
source sentence fl J and the target sentence el / 
into a sequence of phrases (k = 1,..., K): 
fx g = /1 ~ , fk = fjk-x+l,'",fjk 
ef ---- el K , ek ---- eik_l+l,...,eik 
In order to simplify the notation and the pre- 
sentation, we ignore the fact that there can be a 
large number of possible segmentations and as- 
sume that there is only one segmentation. In the 
previous section, we have described the align- 
ment within the phrases. For the alignment 5~" 
• between the source phrases ~1K and the target 
phrases/~, we obtain the following equation: 
Pr(f~g\[e{) = Pr(/~l~) 
aft 
= X:Pr(afl  )- 
K 
= II p(akla -z, K) p(Ll .k) 
5,1K k=l 
23 
For the phrase level alignment we use a 
first-order alignment model p(Sklgl k-~,K) = 
p(SklSk_l, K) which is in addition constrained 
to be a permutation of the K phrases. 
For the translation of one phrase, we intro- 
duce the alignment template as an unknown 
variable: 
P(\]I~) = Z P(zle)"P(Ylz, e) (5) 
Z 
The probability p(zl~ ) to apply an alignment 
template gets estimated by relative frequencies 
(see next section). The probability p(flz, ~) is 
decomposed by Eq. (2). 
3.3 Training 
In this section we show how we obtain the pa- 
rameters of our translation model by using a 
parallel training corpus: 
1. We train two HMM alignment models (Vo- 
gel et al., 1996) for the two translation di- 
rections f ~ e and e ~ f by applying 
the EM-algorithm. However we do not ap- 
ply maximum approximation in training, 
thereby obtaining slightly improved align- 
ments. 
2. For each translation direction we calcu- 
late the Viterbi-alignment of the transla- 
tion models determined in the previous 
step. Thus we get two alignment vectors 
al J and bl / for each sentence. 
We increase the quality of the alignments 
by combining the two alignment vectors 
into one alignment matrix using the fol- 
lowing method. A1 = {(aj,j)\[j = 1... J} 
and A2 = {(i, bi)li = 1... I} denote the set 
of links in the two Viterbi-alignments. In 
a first step the intersection A = A1 n A2 
is determined. The elements within A are 
justified by both Viterbi-alignments and 
are therefore very reliable. We now ex- 
tend the alignment A iteratively by adding 
links (i, j) occurring only in A1 or in A2 if 
they have a neighbouring link already in A 
or if neither the word fj nor the word ei 
are aligned in A. The alignment (i, j) has 
the neighbouring links (i - 1,j), (i,j - 1), 
(i + 1, j), and (i, j + 1). In the Verbmobil 
task (Table 1) the precision of the baseline 
Viterbi alignments is 83.3 percent with En- 
glish as source language and 81.8 percent 
with German as source language. Using 
this heuristic we get an alignment matrix 
with a precision of 88.4 percent without loss 
in recall. 
3. We estimate a bilingual word lexicon p(fle) 
by the relative frequencies of the alignment 
determined in the previous step: 
p(fle) _ hA(f, e) (6) n(e) 
Here nA(f,e) is the frequency that the 
word f is aligned to e and n(e) is the fre- 
quency of e in the training corpus. 
4. We determine word classes for source and 
target language. A naive approach for do- 
ing this would be the use of monolingually 
optimized word classes in source and tar- 
get language. Unfortunately we can not ex- 
pect that there is a direct correspondence 
between independently optimized classes. 
Therefore monolingually optimized word 
classes do not seem to be useful for ma- 
chine translation. 
We determine correlated bilingual classes 
by using the method described in (Och, 
1999). The basic idea of this method is to 
apply a maximum-likelihood approach to 
the joint probability of the parallel training 
corpus. The resulting optimization crite- 
rion for the bilingual word classes is similar 
to the one used in monolingual maximum- 
likelihood word clustering. 
5. We count all phrase-pairs of the training 
corpus which are consistent with the align- 
ment matrix determined in step 2. A 
phrase-pair is consistent with the align- 
ment if the words within the source phrase 
are only aligned to words within the tar- 
get phrase. Thus we obtain a count n(z) 
of how often an alignment template oc- 
curred in the aligned training corpus. The 
probability of using an alignment template 
needed by Eq. (5) is estimated by relative 
frequency: 
p(z = (F,E,.a)I~) = n(z)" ~(k,E(~)) 
n(E(~)) 
(7) 
Fig. 3 shows some of the extracted align- 
ment templates. The extraction algorithm 
24 
nineteenth " 
the ' 
about ° 
how " 
O 
====================== 
th. ........... I "1" 
in ........... I "1" 
o.oloo  ........... l" I" 
two ........... 1• l" 
maybe ......... 
• o o . ..... at ,~ 
• • ....... 
• • . . 0 ........ 
.~ ~ | ~ ~ "Y~ ~'~ ~ g'- 
O 
• al 
Figure 3: Example of a word alignment and 
some learned alignment templates. 
does not perform a selection of good or bad 
alignment templates - it simply extracts all 
possible alignment templates. 
3.4 Search 
For decoding we use the following search crite- 
rion: 
arg max {p(e~).p(e~lf~)) (8) 4 
This decision rule is an approximation to Eq. (1) 
which would use the translation probability 
p(flJle{). Using the simplification it is easy to 
integrate translation and language model in the 
search process as both models predict target 
words. As experiments have shown this simpli- 
fication does not affect the quality of translation 
results. 
To allow the influence of long contexts we 
use a class-based five-gram language model with 
backing-off. 
The search space denoted by Eq. (8) is very 
large. Therefore we apply two preprocessing 
steps before the translation of a sentence: 
1. We determine the set of all source phrases 
in f for which an applicable alignment tem- 
plate exists. Every possible application of 
an alignment template to a sub-sequence 
of the source sentence is called alignment template instantiation. 
2. We now perform a segmentation of the in- 
put sentence. We search for a sequence of 
phrases fl o...o/k = fl J with: 
K 
arg max II maxz p(zlfk ) (9) \]lO...oh=:: k=l 
This is done efficiently by dynamic pro- 
gramming. Because of the simplified de- 
cision rule (Eq. (8)) it is used in Eq. (9) 
p(z\]fk) instead of p(z\]~k). 
Afterwards the actual translation process be- 
gins. It has a search organization along the po- 
sitions of the target language string. In search 
we produce partial hypotheses, each of which 
contains the following information: 
1. the last target word produced, 
2. the state of the language model (the classes 
of the last four target words), 
3. a bit-vector representing the already cov- 
ered positions of the source sentence, 
4. a reference to the alignment template in- 
stantiation which produced the last target 
word, 
5. the position of the last target word in the 
alignment template instantiation, 
6. the accumulated costs (the negative loga- 
rithm of the probabilities) of all previous 
decisions, 
7. a reference to the previous partial hypoth- 
esis. 
A partial hypothesis is extended by append- 
ing one target word. The set of all partial hy- 
potheses can be structured as a graph with a 
source node representing the sentence start, leaf 
nodes representing full translations and inter- 
mediate nodes representing partial hypotheses. 
We recombine partial hypotheses which cannot 
be distinguished by neither language model nor 
translation model. When the elements 1 - 5 of 
two partial hypotheses do not allow to distin- 
guish between two hypotheses it is possible to 
drop the hypothesis with higher costs for the 
subsequent search process. 
We also use beam-search in order to handle 
the huge search space. We compare in beam- 
search hypotheses which cover different parts of 
25 
the input sentence. This makes the compari- 
son of the costs somewhat problematic. There- 
fore we integrate an (optimistic) estimation of 
the remaining costs to arrive at a full trans- 
lation. This can be done efficiently by deter- 
mining in advance for each word in the source 
language sentence a lower bound for the costs 
of the translation of this word. Together with 
the bit-vector stored in a partial hypothesis it 
is possible to achieve an efficient estimation of 
the remaining costs. 
4 Translation results 
The "Verbmobil Task" (Wahlster, 1993) is a 
speech translation task in the domain of ap- 
pointment scheduling, travel planning, and ho- 
tel reservation. The task is difficult because it 
consists of spontaneous speech and the syntac- 
tic structures of the sentences are less restricted 
and highly variable. 
The translation direction is from German to 
English which poses special problems due to the 
big difference in the word order of the two lan- 
guages. We present results on both the text 
transcription and the speech recognizer output 
using the alignment template approach and the 
single-word based approach. 
The text input was obtained by manu- 
ally transcribing the spontaneously spoken sen- 
tences. There was no constraint on the length of 
the sentences, and some of the sentences in the 
test corpus contain more than 50 words. There- 
fore, for text input, each sentence is split into 
shorter units using the punctuation marks. The 
segments thus obtained were translated sepa- 
rately, and the final translation was obtained 
by concatenation. 
In the case of speech input, the speech recog- 
nizer along with a prosodic module produced 
so-called prosodic markers which are equivalent 
to punctuation marks in written language. The 
experiments for speech input were performed on 
the single-best sentence of the recognizer. The 
recognizer had a word error rate of 31.0%. Con- 
sidering only the real words without the punc- 
tuation marks, the word error rate was smaller, 
namely 20.3%. 
A summary of the corpus used in the experi- 
ments is given in Table 1. Here the term word 
refers to full-form word as there is no morpho- 
logical processing involved. In some of our ex- 
periments we use a domain-specific preprocess- 
ing which consists of a list of 803 (for German) 
and 458 (for English) word-joinings and word- 
splittings for word compounds, numbers, dates 
and proper names. To improve the lexicon prob- 
abilities and to account for unseen words we 
added a manually created German-English dic- 
tionary with 13 388 entries. The classes used 
were constrained so that all proper names were 
included in a single class. Apart from this, the 
classes were automatically trained using the de- 
scribed bilingual clustering method. For each of 
the two languages 400 classes were used. 
For the single-word based approach, we used 
the manual dictionary as well as the preprocess- 
ing steps described above. Neither the transla- 
tion model nor the language model used classes 
in this case. In principal, when re-ordering 
words of the source string, words of the German 
verb group could be moved over punctuation 
marks, although it was penalized by a constant 
cost. 
Table 1: Training and test conditions for the 
Verbmobil task. The extended vocabulary in- 
cludes the words of the manual dictionary. The 
trigram perplexity (PP) is given. 
Train 
Test 
Sentences 
Words 
Voc. 
Extended Voc. 
Sentences 
Words 
PP 
German I English 
34465 
363 514 383 509 
6 381 3 766 
9 062 8 437 
147 
1 968 2 173 
- 31.5 
In all experiments, we use the following three 
error criteria: 
• WER (word error rate): 
The WER is computed as the minimum 
number of substitution, insertion and dele- 
tion operations that have to be performed 
to convert the generated string into the tar- 
get string. This performance criterion is 
widely used in speech recognition. 
• PER (position-independent word error 
rate): 
A shortcoming of the WER is the fact that 
it requires a perfect word order. This is 
26 
Table 2: Experiments for Text and Speech Input: Word error rate (WER), position- 
independent word error rate (PER) and subjective sentence error rate (SSER) with/without pre- 
processing (147 sentences = 1 968 words of the Verbmobil task). 
Single-Word Based Approach 
Alignrtlent Templates 
Text No 53.4 38.3 35.7 
Yes 56.0 41.2 35.3 
Speech No 67.8 50.1 54.8 
Yes 67.8 51.4 52.7 
Text 
Speech 
No 35.3 49.5 31.5 
Yes 48.3 35.1 27.2 
No 63.5 45.6 52.4 
Yes 62.8 45.6 50.3 
particularly a problem for the Verbmobil 
task, where the word order of the German- 
English sentence pair can be quite different. 
As a result, the word order of the automat- 
ically generated target sentence can be dif- 
ferent from that of the target sentence, but 
nevertheless acceptable so that the WER 
measure alone could be misleading. In or- 
der to overcome this problem, we intro- 
duce as additional measure the position- 
independent word error rate (PER). This 
measure compares the words in the two sen- 
tences without taking the word order into 
account. Words that have no matching 
counterparts are counted as substitution 
errors. Depending on whether the trans- 
lated sentence is longer or shorter than the 
target translation, the remaining words re- 
sult in either insertion or deletion errors in 
addition to substitution errors. The PER 
is guaranteed to be less than or equal to 
the WER. 
SSER (subjective sentence error rate): 
For a more detailed analysis, subjective 
judgments by test persons are necessary. 
Each translated sentence was judged by a 
human examiner according to an error scale 
from 0.0 to 1.0. A score of 0.0 means that 
the translation is semantically and syntac- 
tically correct, a score of 0.5 means that a 
sentence is semantically correct but syntac- 
tically wrong and a score of 1.0 means that 
the sent6nce is semantically wrong. The 
human examiner was offered the translated 
sentences of the two approaches at the same 
time. As a result we expect a better possi- 
bility of reproduction. 
The results of the translation experiments 
using the single-word based approach and the 
alignment template approach on text input and 
on speech input are summarized in Table 2. The 
results are shown with and without the use of 
domain-specific preprocessing. The alignment 
template approach produces better translation 
results than the single-word based approach. 
From this we draw the conclusion that it is im- 
portant to model word groups in source and tar- 
get language. Considering the recognition word 
error rate of 31% the degradation of about 20% 
by speech input can be expected. The average 
translation time on an Alpha workstation for a 
single sentence is about one second for the align- 
ment template apprbach and 30 seconds for the 
single-word based search procedure. 
Within the Verbmobil project other trans- 
lation modules based on rule-based, example- 
based and dialogue-act-based translation are 
used. We are not able to present results with 
these methods using our test corpus. But in 
the current Verbmobil prototype the prelimi- 
nary evaluations show that the statistical meth- 
ods produce comparable or better results than 
the other systems. An advantage of the sys- 
tem is that it is robust and always produces a 
translation result even if the input of the speech 
recognizer is quite incorrect. 
5 Summary 
We have described two approaches to perform 
statistical machine translation which extend the 
baseline alignment models. The single-word 
27 
based approach allows for the the possibility of 
one-to-many alignments. The alignment tem- 
plate approach uses two different alignment lev- 
els: a phrase level alignment between phrases 
and a word level alignment between single 
words. As a result the context of words has 
a greater influence and the changes in word 
order from source to target language can be 
learned explicitly. An advantage of both meth- 
ods is that they learn fully automatically by us- 
ing a bilingual training corpus and are capa- 
ble of achieving better translation results on a 
limited-domain task than other example-based 
or rule-based translation systems. 
Acknowledgment 
This work has been partially supported as 
part of the Verbmobil project (contract number 
01 IV 701 T4) by the German Federal Ministry 
of Education, Science, Research and Technol- 
ogy and as part of the EuTrans project by the 
by the European Community (ESPRIT project 
number 30268). 

References 

Hiyan Alshawi, Srinivas Bangalore, and Shona 
Douglas. 1998. Automatic acquisition of hi- 
erarchical transduction models for machine 
translation. In COLING-ACL '98: Annual 
Conf. of the Association for Computational 
Linguistics and 17th Int. Conf. on Compu- 
tational Linguistics, volume 1, pages 41-47, 
Montreal, Quebec, Canada, August. 

Peter F. Brown, Stephen A. Della Pietra, Vin- 
cent J. Della Pietra, and Robert L. Mercer. 
1993. The mathematics of statistical machine 
translation: Parameter estimation. Compu- 
tational Linguistics, 19(2):263-311. 

Hermann Ney. 1999. Speech translation: Cou- 
pling of recognition and translation. In Proc. 
Int. Conf. on Acoustics, Speech, and Sig- 
nal Processing, pages 517-520, Phoenix, AR, 
March. 

Sonja Niessen, Stephan Vogel, Hermann Ney, 
and Christoph Tillmann. 1998. A DP-based 
search algorithm for statistical machine trans- 
lation. In COLING-ACL '98: Annual Conf. 
of the Association for Computational Lin- 
guistics and 17th Int. Conf. on Computa- 
tional Linguistics, pages 960-967, Montreal, 
Canada, August. 

Franz Josef Och and Hans Weber. 1998. Im- 
proving statistical natural language transla- 
tion with categories and rules. In Proc. of 
the 35th Annual Conf. of the Association for 
Computational Linguistics and the 17th Int. 
Conf. on Computational Linguistics, pages 
985-989, Montreal, Canada, August. 

Franz Josef Och. 1999. An efficient method to 
determine bilingual word classes. In EACL 
'99: Ninth Conf. of the Europ. Chapter of 
the Association for Computational Linguis- 
tics, Bergen, Norway, June. 

Christoph Tillmann, Stephan Vogel, Hermann 
Ney, and Alex Zubiaga. 1997. A DP-based 
search using monotone alignments in statisti- 
cal translation. In Proc. 35th Annual Conf. of 
the Association for Computational Linguis- 
tics, pages 289-296, Madrid, Spain, July. 

Stephan Vogel, Hermann Ney, and Christoph 
Tillmann. 1996. HMM-based word align- 
ment in statistical translation. In COLING 
'96: The 16th Int. Conf. on Computational 
Linguistics, pages 836-841, Copenhagen, Au- 
gust. 

Wolfgang Wahlster. 1993. Verbmobih Transla- 
tion of face-to-face dialogs. In Proc of the MT 
Summit IV, pages 127-135, Kobe, Japan. 

Ye-Yi Wang and Alex Waibel. 1998. Modeling 
with structures in statistcal machine transla- 
tion. In COLING-A CL '98: Annual Conf. of 
the Association for Computational Linguis- 
tics and 17th Int. Conf. on Computational 
Linguistics, volume 2, pages 1357-1363, Mon- 
treal, Quebec, Canada. 
