Improving Statistical Natural Language Translation with 
Categories and Rules 
Franz Josef Och and Hans Weber 
FAU Erlangen - Computer Science Institute, 
IMMD VIII - Artificial Intelligence, 
Am Weichselgarten 9, 91058 Erlangen - Tennenlohe, Germany 
{faoch, weber}@immd8, inf ormatik, uni-erlangen, de 
Abstract 
This paper describes an all level approach on 
statistical natural language translation (SNLT). 
Without any predefined knowledge the system 
learns a statistical translation lexicon (STL), 
word classes (WCs) and translation rules (TRs) 
from a parallel corpus thereby producing a gen- 
eralized form of a word alignment (WA). The 
translation process itself is realized as a beam 
search. In our method example-based tech- 
niques enter an overall statistical approach lead- 
ing to about 50 percent correctly translated 
sentences applied to the very difficult English- 
German VERBMOBIL spontaneous speech cor- 
pus. 
1 Introduction 
In SNLT the transfer itself is realized as a max- 
imization process of the form 
Trans(d) = argmax e P(e\[d) (1) 
Here d is a given source language (SL) sentence 
which has to be translated into a target lan- 
guage (TL) sentence e. In order to model the 
distributions P(e\[d) all approaches in SNLT use 
a "divide and conquer" strategy of approximat- 
ing P(e\[d) by a combination of simpler models. 
The problem is to reduce parameters in a suffi- 
cient way but end up with a model still able to 
describe the linguistic facts of natural language 
translation. 
The work presented here uses two approxi- 
mations for P(e\[d). One approximation is used 
for to gain the relevant parameters in training 
while a modified formula is subject of decoding 
translations. In detail, we impose the following 
modifications with respect to approaches pub- 
lished in the last decade: 1. A refined distance 
weight for the STL probabilities is used which 
allows for a good modeling of the effects caused 
by syntactic phrases. 2. In order to account for 
collocations a WA technique is used, where one- 
to-n and n-to-one WAs are allowed. 3. For 
the translation WCs are used which are con- 
structed using clustering techniques, where the 
STL forms a part of the optimization criterion. 
4. A set of TRs is learned mapping sequences 
of SL WCs to sequences of TL WCs. 
Throughout the paper the four topics above 
are described in more detail. Finally we report 
on experimental results produced on the VERB- 
MOBIL corpus. 
2 Learning of the Translation 
Lexicon 
In order to determine the STL, we use a sta- 
tistical model for translation and the EM algo- 
rithm to adjust its model parameters. The sim- 
ple model 1 (Brown et al., 1993) for the trans- 
lation of a SL sentence d = dl...dt in a TL 
sentence e = el... em assumes that every TL 
word is generated independently as a mixture 
of the SL words: 
m l 
P(e\[d) ,,~ H ~ t(ej\[di) (2) 
j=l i=O 
In the equation above t(ej\[di) stands for the 
probability that ej is generated by di. 
The assumption that each SL word influences 
every TL word with the same strength appears 
to be too simple. In the refined model 2 (Brown 
et al., 1993) alignment probabilities a(ilj , l, m) 
are included to model the effect that the po- 
sition of a word influences the position of its 
translation. 
The phrasal organization of natural languages 
is well known and has been described by (Jack- 
endorff, 1977) among many others. The tra- 
985 
ditional alignment probabilities depend on ab- 
solute positions and do not take that into ac- 
count, as has already been noted by (Vogel et 
al., 1996). Therefore we developed a kind of 
relative weighting probability. The following 
model -- which we will call the model 2 ~ -- 
makes the weight between the words di and ej 
dependent on the relative distances between the 
words dk which generated the previous word 
ej-1 : 
l 
s(i\]j, ej_z,d) ~ ~ d(i- k\]l).t(ej_z\]dk) (3) 
k=0 
Here d(i - kll ) is the probability that word di 
influences a word ej if the previous word ej-1 is 
influenced by dk. As an effect of such a weight 
a (phrase-)cluster of words being moved over a 
long distance receives additional 'cost' only at 
the ends of the cluster. So we have the final 
translation probability for model 2~: 
m l 
P(eld) ~" II ~ t(ejldi)s(i\[j, ej-l,d) (4) 
j=l i=0 
The parameters involved can be determined us- 
ing the EM algorithm (Baum, 1972). The ap- 
plication of this algorithm to the basic prob- 
lem using a parallel bilingual corpus aligned on 
the sentence level is described in (Brown et al., 
1993). 
3 Determining a Word Alignment 
The kind of WA we use is more general than 
the often used WA through a vector, where ev- 
ery TL word is generated by exactly one SL 
word. We use a matrix Z for every sentence 
pair, whose fields describe whether or not two 
words are aligned. In this approach, multiple 
words can be aligned to one TL word, which is 
motivated by collocation phenomena as for in- 
stance German compound nouns. Alignments 
may look like the one in figure 1 according to our 
method. The matrix Z contains i + 1 lines and 
j rows with binary values. The value zij = 1 
(zij = 0) means that the word i influences (not) 
the word j. In figure 1 every link stands for 
zij = l. 
The models 1, 2 and 2 ~ and some similar mod- 
~~ tmontag 
Figure 1: Alignment example. 
els can be described in the form 
m l 
P(eld) "" 1-I ~ xij (5) 
j=l i=0 
where the value xij is the strength of the influ- 
ence of word di to word ej. We use a thresh- 
old 0 < 1 in such a way that while the sum 
~=o xi~j of the first s values is smaller than 
O. ~tk= o Xkj we set zi~j = O. The other values 
are set to 1. The permutation i0,..., il sorts the 
xij so that Xioj < ... < Xilj. 
Interestingly using such a WA technique does 
not in general lead to the same results when 
applied from TL to SL and vice versa. If we 
use P(e\[d) or P(dle ) we receive different WAs 
z~ d and z d-e. Intuitively the relation between the 
words of the sentences should be symmetric and 
there should be the same WA. It is possible to 
enforce the symmetry with zij = zed. zdeij, in 
order to make a link between two words only if 
there is a link in both WAs. 
It is possible to include the WA into the EM 
algorithm for the estimation of the model prob- 
abilities. This can be done by replacing t(ej Idi) 
by t(ejldi).zi j. The resulting STL becomes 
much cleaner in the sense that it does not con- 
tain so many wrong entries (see section 7). 
4 Learning of Translation Rules 
The incorporation of TRs adds an "example- 
based" touch to the statistical approach. In a 
very naive approach a TR could be represented 
by a translation example. The obvious advan- 
tage is an expectable good quality of the trans- 
lated sentences. The disadvantage is the fact 
that almost no sentence can be translated be- 
cause every corpus would have too few examples 
-- the generalization capability of the naive ap- 
proach is very limited. 
We desired a general kind of TR which does 
not use explicit linguistic properties of the used 
languages. In addition the rules should general- 
ize from very sparse data. Therefore it seemed 
986 
natural to use WCs and shorter sequences to 
end up with a set of rather general rules. In or- 
der to achieve a good learning performance, all 
the WCs of a language are pairwise disjoint (see 
section 5). The function C(.) gives the class of 
a word or the sequence of WCs of a sequence of 
words. 
Our TRs axe triples (D, E, Z) where D is a 
sequence of SL WCs, E is a sequence of TL WCs 
and Z is a WA matrix between D and E. For 
using one rule in the translation process we first 
rewrite the probability P(eld): 
P(eld ) = ~ P(E, Zld ) • P(elE, Z,d ) (6) 
E,Z 
In order to simplify the maximization (equation 
1) we use only the TR which gives the maximum 
probability. 
During the learning of those TRs we count all 
extractable rules occurring in the aligned cor- 
pus and define the probability p(E, ZlC(d)) 
P(E, Zld ) in terms of the relative frequency. 
We approximate P(elE, Z,d ) by simpler 
probabilities, so that we finally need a language 
model p(ejle~-l), a translation model p(ej Id, Z) 
and a probability p(ejlEj). For p(ejle~ -1) we 
use a class-based polygram language model 
(Schukat-Talamazzini, 1994). For the transla- 
tion probability p(ej Id, Z) we use model 1 and 
include the information of the WA: 
l 
p(ejld , Z):= ~ t(ejldi) . zi j (7) 
i=0 
Figure 2 shows how the application of those 
rules works in principle. We arrive at a list of 
word hypotheses with probabilities for each po- 
sition. Neglecting the language model, the best 
decision would be to independently choose the 
most probable word for every position. 
In general the translation of a sentence in- 
volves more than one rule and usually there are 
many rules applicable. An applicable rule is one 
where the sequence of SL WCs matches a se- 
quence of WCs in the sentence. So in the gen- 
eral case we have to decide for a set of rules we 
want to apply. This set of rules has to cover the 
sentence, this means that every word is used in 
a rule and that no word is used twice or more 
times. The next step is to decide how to ar- 
range the generated units to get the translated 
sentence. Finally we have to decide for every 
position which word to use. We want all those 
decisions to be optimal in the sense that the 
following product is maximized: 
L 
p(e (jl) o...o e(JD) • 1-I P(z(k), E(k)IC(d(k)) 
k=l 
• p(e (jk) IZ (k) , E (k) , d (k)) (8) 
Here L is the number of SL units, d (k) is the k-th 
SL unit, e (k) is the k-th TL unit and jl,...,ji 
is a permutation of the numbers 1,..., L. 
5 Learning of Category Systems 
During the last decade some publications have 
discussed the problem of learning WCs using 
clustering techniques based on maximum like- 
lihood criteria applied to single language cor- 
pora. The question which we pose in addition 
is: Which WCs are suitable for translation? It 
seems to make sense to require that the used 
WCs in the two languages are correlated, so 
that the information about the class of a SL 
word gives much information about the class of 
the generated TL word. Therefore it has been 
argued in (Fung and Wu, 1995) that indepen- 
dently generated WCs are not good for the use 
in translation. 
For the automatic generation of class systems 
exists a well known procedure (see (Kneser and 
Ney, 1993), (Och, 1995)) which maximizes the 
perplexity of the language model for a training 
corpus by moving one word from a class to an- 
other in an iterative procedure. The function 
ML(CINw_~w, ) which has to be optimized de- 
pends only on the count function Nw~w, which 
counts the frequency that the word w' comes 
after the word w. 
Using two sets of WCs for the TL and SL 
which are independent (method INDEP) does 
not guarantee that those WCs are much cor- 
related. The resulting WCs have only the prop- 
erty that the information about the class of a 
word w has much information about the class 
of the following word w'. We want for the 
WCs used for translation that the information 
about the WC of a word has much information 
about the WC of the translation. For the use 
of the standard method for optimizing WCs we 
need only define a count function Nd-+e, which 
we do by Nd-.e(d,e) := t(eld)" n(e). In the 
987 
source text translation rule \[2 word hypotheses 
r-=-I V-r-1 
\[~ translated text 
Figure 2: Application of a Rule. 
same way a count function Ne-.d can be deter- 
mined and we get the new optimization criterion 
M L ( Cd t~Ce I Nd--+e-J- Need). The resulting classes 
are strongly correlated, but rarely contain words 
with similar syntactic/semantic properties. To 
arrive at WCs having both (method COMB), we 
determine TL WCs with the first method and 
afterwards we determine SL WCs with the sec- 
ond method. 
So we can use the well known iterative 
method to end up with WCs in different lan- 
guages which are correlated. From those WCs 
we expect that they are more suitable for build- 
ing the TRs from section 4 and finally result in 
a better overall translation performance. 
6 Translation as a Search Problem 
The problem of finding the translation of a sen- 
tence can be viewed as a search problem for a 
path with minimal cost in a tree. If we apply 
the negative logarithm to the product of proba- 
bilities in equation 8 we arrive at a sum of costs 
which has to be minimized. The costs stem from 
the language model, the rule probabilities and 
the translation probabilities. In the search tree 
every node represents a partial translation for 
the first words or a full translation. The leaves 
of the tree are the nodes where the applied rules 
define a complete cover of the SL sentence. To 
reduce the search space we use additional costs 
for changing the order of the fragments. 
We use a beam search strategy (Greer et al., 
1982) to find a good path in this tree. To make 
the search feasible we had to implement some 
problem specific heuristics. 
7 Results 
The experiments in this section have all been 
carried out on the bilingual German-English 
VERBMOBIL corpus. This corpus consists of 
spontaneous utterances from negotiation di- 
alogs which had originally been produced in 
German. For training we used 11 500 randomly 
chosen sentence pairs. 
The first experiment shall be understood as 
an illustration for our improved technique in 
generating a STL using the WA in the EM- 
algorithm. We generated a STL using 10 EM- 
iterations for model 1 and 10 iterations for 
model 2q The whole process took about 4 hours 
for our corpus. Below are given some STL en- 
tries for German words. The probabilities t(eld ) 
are written in parentheses. 
• Tuesday--+Dienstag (0.83), den (0.05), 
COMMA (0.042), am (0.038), dienstags 
(0.018), der (0.009), also (0.0069), passen 
(0.0019), diesem (0.0013), steht (0.0012) 
• Frankfurt--+Frankfurt (0.67), nach (0.12), 
in (0.081), mit (0.068), um (0.031), 
habe (0.02), besuchen (0.0078), wiederum 
(0.0036) 
The top positions are always plausible trans- 
lations. But there are many improper transla- 
tions produced. When we include the WA in the 
EM algorithm as described in section 3 we can 
produce fewer lexicon entries of a much better 
quality: 
• Tuesday-+Dienstag (0.97), dienstags 
(0.029) 
• Frankfurt--+Frankfurt (1) 
The following two corresponding WCs (out of 
600) show a typical result of the method COMB 
to determine correlated WCs: 
• Mittwoch, Donnerstag, Freitag, 
Sonnabend, Friihlingsanfang, Karsamstag, 
Volkstrauertag, Weihnachtsferien, Som- 
merschule, Thomas, einschlieflen 
• Wednesday, Thursday, Friday, Thursdays, 
Fridays, Thomas, Veterans', mourning, na- 
tional, spending, spring, summer-school 
988 
To evaluate the complete system we translated 
200 randomly chosen sentences drawn from an 
independent test corpus and checked manually 
how many of them constituted acceptable trans- 
lations. Since we used a spontaneous speech 
corpus many sentences were grammatically in- 
correct. A translation is classified 'correct' if 
the translation is an error-free (spontaneaous 
speech) utterance and classified 'understand- 
able' if the intention of the utterance is trans- 
lated. The 100 sentences had a mean sentence 
length of 10 words. The used STL was gener- 
ated using model 2' (see section 2). 
correct understandable 
INDEP 46.5 % 64 % 
COMB 52 % 71% 
Table h Quality of Translation. 
Some example translations: 
• was h~iltst du von zweiter Februar nachmit- 
tags, nach fiinfzehn Uhr --4 what do you 
think about the second of February in the 
afternoon, after three o'clock 
• I wanted to fix a time with you for a five- 
day business trip to Stuttgart --4 ich wollte 
mit Ihnen einen Termin ausmachen fiir eine 
f/inft~igige Gesch£ftsreise nach Stuttgart 
8 Conclusions 
We have presented a couple of improvements 
to SNLT. The most important changes are the 
translation model 2', the representation of WA 
using a matrix, a method to determine corre- 
lated WCs and the use of TRs to constrain 
search. In the future, the rule mechanism 
should be extended. So far the rules learned 
are only loop-free finite state transducers. Still 
many translation errors stem from the inability 
to model long distance dependencies. We intend 
to move to finite state cascades or context free 
grammars in future work. With respect to the 
category sets we feel that an additional morpho- 
logical model could further improve the transla- 
tion quality. As it stands the system still makes 
many errors concerning the number of nominals 
and verbs. This is especially important when 
the language pairs differ with respect to the pro- 
ductivity of their inflectional systems. 
9 Acknowledgements 
We have to thank Stefan Vogel from the RWTH 
Aachen explicitly, for the material he provided 
and G/inther G5rz for general promotion. The 
work is part of the German Joint Project VERB- 
MOBIL. This work was funded by the German 
Federal Ministry for Research and Technology 
(BMBF) in the framework of the Verbmobil 
Project under Grant BMBF 01 IV 701 K 5. The 
responsibility for the contents of this study lies 
with the authors. 

References 
L.E. Baum. 1972. An Inequality and Asso- 
ciated Maximization Technique in Statisti- 
cal Estimation for Probabilistic Functions of 
Markov Processes. Inequalities, 3:1-8. 
P. F. Brown, S. A. Della Pietra, V. J. 
Della Pietra, and R. L. Mercer. 1993. The 
mathematics of statistical machine transla- 
tion: Parameter estimation. Computational 
Linguistics, 19(2):263-311. 
P. Fung and D. Wu. 1995. Coerced markov 
models for cross-lingual lexical-tag relations. 
In The Sixth Int. Conf on Theor. and Method- 
ological Issues in Machine Translation, pages 
240-255, Leuven, Belgium, July. 
K. Greer, B. Lowerre, and L. Wilcox. 1982. 
Acoustic Pattern Matching and Beam Search- 
ing. In Proc. Int. Conf. on Acoustics, Speech, 
and Signal Processing, pages 1251-1254, 
Paris. 
R. Jackendorff. 1977. X-bar-syntax: A study 
of phrase structure. In Linguistic Inquiry 
Monograph 2. 
R. Kneser and H. Ney. 1993. Improved Clus- 
tering Techniques for Class-Based Statistical 
Language Modelling. In Eurospeech, pages 
973-976. 
F. J. Och. 1995. Maximum-Likelihood- 
Sch~itzung von Wortkategorien mit Verfahren 
der kombinatorischen Optimierung. Studien- 
arbeit, FAU Erlangen-Niirnberg. 
E.G. Schukat-Talamazzini. 1994. Automatische 
Spracherkennung. Vieweg, Wiesbaden. 
S. Vogel, H. Ney, and C. Tillmann. 1996. 
HMM-Based Word Alignment in Statistical 
Translation. In Proc. Int. Conf. on Compu- 
tational Linguistics, pages 836-841, Kopen- 
hagen, August. 
