Proceedings of EACL '99 
An Efficient Method for Determining Bilingual Word Classes 
Franz Josef Och 
Lehrstuhl ffir Informatik VI 
RWTH Aachen - University of Technology 
Ahornstrai3e 55 
52056 Aachen 
GERMANY 
och@informatik.rwth-aachen.de 
Abstract 
In statistical natural language process- 
ing we always face the problem of sparse 
data. One way to reduce this problem is 
to group words into equivalence classes 
which is a standard method in statistical 
language modeling. In this paper we de- 
scribe a method to determine bilingual 
word classes suitable for statistical ma- 
chine translation. We develop an opti- 
mization criterion based on a maximum- 
likelihood approach and describe a clus- 
tering algorithm. We will show that the 
usage of the bilingual word classes we get 
can improve statistical machine transla- 
tion. 
1 Introduction 
Word classes are often used in language modelling 
to solve the problem of sparse data. Various clus- 
tering techniques have been proposed (Brown et 
al., 1992; Jardino and Adda, 1993; Martin et al., 
1998) which perform automatic word clustering 
optimizing a maximum-likelihood criterion with 
iterative clustering algorithms. 
In the field of statistical machine translation 
we also face the problem of sparse data. Our 
aim is to use word classes in statistical machine 
translation to allow for more robust statistical 
translation models. A naive approach for doing 
this would be the use of mono-lingually optimized 
word classes in source and target language. Un- 
fortunately we can not expect these independently 
optimized classes to be correspondent. There- 
fore mono-lingually optimized word classes do not 
seem to be useful for machine translation (see also 
(Fhng and Wu, 1995)). We define bilingual word 
clustering as the process of forming correspond- 
ing word classes suitable for machine translation 
purposes for a pair of languages using a parallel 
training corpus. 
The described method to determine bilingual 
word classes is an extension and improvement 
of the method mentioned in (Och and Weber, 
1998). Our approach is simpler and computation- 
ally more efficient than (Wang et al., 1996). 
2 Monolingual Word Clustering 
The task of a statistical language model is to es- 
timate the probability Pr(w N) of a sequence of 
words w N = wl...wg. A simple approximation 
of Pr(w N) is to model it as a product of bigram 
probabilities: Pr(w~) N = I-\[i=1P(WilWi-1) • If we 
want to estimate the bigram probabilities p(w\[w') 
using a realistic natural language corpus we are 
faced with the problem that most of the bigrams 
are rarely seen. One possibility to solve this prob- 
lem is to partition the set of all words into equiv- 
alence classes. The function C maps words w to 
their classes C(w). Rewriting the corpus probabil- 
ity using classes we arrive at the following proba- 
bility model p(wNlC): 
N 
P(wNIC) := 
i=1 (1) 
In this model we have two types of probabili- 
ties: the transition probability p(CIC ~) for class 
C given its predecessor class C' and the member- 
ship probability p(wlC ) for word w given class C. 
To determine the optimal classes C for a given 
number of classes M we perform a maximum- 
likelihood approach: 
C = arg mpx p(w lc) (2) 
We estimate the probabilities of Eq. (1) by 
relative frequencies: p(CIC' ) := n(CIC')/n(C'), 
p(wlC ) = n(w)/n(C). The function n(-) provides 
the frequency of a uni- or bigram in the training 
corpus. If we insert this into Eq. (2) and apply 
the negative logarithm and change the summa- 
tion order we arrive at the following optimization 
71 
Proceedings of EACL '99 
criterion LP1 (Kneser and Ney, 1991): 
LPx(C,n) = - ~ h(n(C\]C')) 
C,C' 
+2 Zh(n(C)) (3) 
C 
= argm~n LPI(C,n). (4) 
The function h(n) is a shortcut for n. log(n). 
It is necessary to fix the number of classes in 
C in advance as the optimum is reached if every 
word is a class of its own. Because of this it is 
necessary to perform an additional optimization 
process which determines the number of classes. 
The use of leaving-one-out in a modified optimiza- 
tion criterion as in (Kneser and Ney, 1993) could 
in principle solve this problem. 
An efficient optimization algorithm for LP1 is 
described in section 4. 
3 Bilingual Word Clustering 
In bilingual word clustering we are interested in 
classes ~" and C which form partitions of the vo- 
cabulary of two languages. To perform bilingual 
word clustering we use a maximum-likelihood ap- 
proach as in the monolingnal case. We maximize 
the joint probability of a bilingual training corpus 
(el, f J): 
= argma  (5) $,fi 
= argmax p(e/\[C) .p(fJle~;C,3)(6) $,~" 
To perform the maximization of Eq. (6) we have to 
model the monolingual a priori probability p(e I IE) 
and the translation probability p(fJte~; E, .T). For 
the first we use the class-based bigram probability 
from Eq. (1). 
To model p(fJle~;8,.T) we assume the exis- 
tence of an alignment a J. We assume that ev- 
ery word fj is produced by the word e~j at posi- 
tion aj in the training corpus with the probability 
P(f~le,~i): 
J 
p(f lc ') = 1\] p(L Icon) 
j=l 
(7) 
The word alignment a J is trained automatically 
using statistical translation models as described in 
(Brown et al., 1993; Vogel et al., 1996). The idea 
is to introduce the unknown alignment al J as hid- 
den variable into a statistical model of the trans- 
lation probability p(fJle~). By applying the EM- 
algorithm we obtain the model parameters. The 
alignment a J that we use is the Viterbi-Alignment 
of an HMM alignment model similar to (Vogel et 
al., 1996). 
By rewriting the translation probability using 
word classes, we obtain (corresponding to Eq. (1)): 
J 
p(f le '; E, = 1\] 
j=l (s) 
The variables F and E denote special classes in 
9 v and ~'. We use relative frequencies to estimate 
p(FIE) and p(flF): 
p(F\[E) = nt(FIE)/ (~F hi(FIE)) 
The function nt(FIE) counts how often the words 
in class F are aligned to words in class E. If we 
insert these relative frequencies into Eq. (8) and 
apply the same transformations as in the monolin- 
gual case we obtain a similar optimization crite- 
rion for the translation probability part of Eq. (6). 
Thus the full optimization criterion for bilingual 
word classes is: 
- ~ h(n(E\[E')) - ~ h(nt(FIE)) 
E,E' E,F 
+2Eh(n(E)) 
E 
+ Z h(~-~' nt(FIE))+ ~--~ h(E nt(FIE)) 
F E E F 
The two count functions n(EIE' ) and nt(FIE ) can 
be combined into one count function ng(X\[Y ) := 
n(XIY)+nt(X\[Y ) as for all words f and all words 
e and e' holds n(fle ) = 0 and nt(ele' ) = O. Using 
the function ng we arrive at the following opti- 
mization criterion: 
LP2((C,~'),ng) = - ~ h(ng(ZlX')) + 
X,X' 
~h(ng,l(X)) + Eh(ng,2(X)) (9) 
X x 
(~,~) = argmin LP2((E,~-),ng) (10) 
Here we defined ng,l(X) = ~'~x, ng(X\[X') and 
ng,2(X) = ~"~x' ng(X'\[X). The variable X runs 
over the classes in E and Y. In the optimiza- 
tion process it cannot be allowed that words of 
72 
Proceedings of EACL '99 
INPUT: Parallel corpus (e~,/~) and number of classes in 6 and ~. 
Determine the word alignment a~. 
Get some initial classes C and ~. 
UNTIL convergence criterion is met: 
FOR EACH word e: 
FOR EACH class E: 
\[Determine the change of LP((E, 9v), rig) if e is moved to E. 
Move e to the class with the largest improvement. 
FOR EACH 'word f: 
FOR EACH class F: 
~ the change of LP((C,.gv), rig) if f is moved to F. 
Move f to the class with the largest improvement. 
OUTPUT: Classes C and 5 r. 
Figure 1: Word Clustering Algorithm. 
different languages occur in one class. It can be 
seen that Eq. (3) is a special case of Eq. (9) with 
 g,1 ----- rig,2. 
Another possibility to perform bilingual word 
clustering is to apply a two-step approach. In a 
first step we determine classes £ optimizing only 
the monolingual part of Eq. (6) and secondly we 
determine classes 5~ optimizing the bilingual part 
(without changing C): 
= argm~n LP2(~,n) (11) 
.~ = argm~n LP2((E, Sr),n~). (12) 
By using these two optimization processes we en- 
force that the classes E are mono-lingually 'good' 
classes and that the classes fi- correspond to ~. 
Interestingly enough this results in a higher trans- 
lation quality (see section 5). 
4 Implementation 
An efficient optimization algorithm for LPz is the 
exchange algorithm (Martin et al., 1998). For 
the optimization of LP2 we can use the same al- 
gorithm with small modifications. Our starting 
point is a random partition of the training corpus 
vocabulary. This initial partition is improved it- 
eratively by moving a single word from one class 
to another. The algorithm to determine bilingual 
classes is depicted in Figure 1. 
If only one word w is moved between the parti- 
tions C and C' the change LP(C, ng) - LP(C', ng) 
can be computed efficiently looking only at classes 
C for which ng(w, C) > 0 or ng(C, w) > 0. We de- 
fine M0 to be the average number of seen predeces- 
sor and successor word classes. With the notation 
I for the number of iterations needed for conver- 
gence, B for the number of word bigrams, M for 
the number of classes and V for the vocabulary 
• • . , • • • . . . • 
fifty-eight ........ • 
six ........ • • frca 
......... 
Hanover " " \]" • ......... 
I• frn~ " 
hourlYtraingOesthe .... ....... • • • 'I 4" ..... 
m 13 
Figure 2: Examples of alignment templates ~. 
size the computational complexity of this algo- 
rithm is roughly I. (B. log 2 (B/V) + V. M. Mo). 
A detailed analysis of the complexity can be found 
in (Martin et al., 1998). 
The algorithm described above provides only a 
local optimum. The quality of the resulting local 
optima can be improved if we accept a short-term 
degradation of the optimization criterion during 
the optimization process. We do this in our imple- 
mentation by applying the optimization method 
threshold accepting (Dueck and Scheuer, 1990) 
which is an efficient simplification of simulated an- 
nealing. 
73 
Proceedings of EACL '99 
Table 1: The EUTRANS-I corpus. 
Train: 
Test: 
Sentences 
Words 
Vocabulary Size 
Sentences 
Words 
Bigr. Perplexity 
Spanish English 
10000 
97131 99292 
686 513 
2 996 
35023 35590 
- 5.2 
Table 2: The EUTRANS-II corpus. 
German English 
Train: 16 226 
Test: 
Sentences 
Words 
Vocabulary Size 
Sentences 
Words 
Bigr. Perplexity 
266080 299945 
39511 25751 
187 
2 556 2 853 
- 157 
5 Results 
The statistical machine-translation method de- 
scribed in (Och and Weber, 1998) makes use of 
bilingual word classes. The key element of this 
approach are the alignment templates (originally 
referred to as translation rules)which are pairs of 
phrases together with an alignment between the 
words of the phrases. Examples of alignment tem- 
plates are shown in Figure 2. The advantage of the 
alignment template approach against word-based 
statistical translation models is that word context 
and local re-orderings are explicitly taken into ac- 
count. 
The alignment templates are automatically 
trained using a parallel training corpus. The 
translation of a sentence is done by a search pro- 
cess which determines the set of alignment tem- 
plates which optimally cover the source sentence. 
The bilingual word classes are used to general- 
ize the applicability of the alignment templates in 
search. If there exists a class which contains all 
cities in source and target language it is possible 
that an alignment template containing a special 
city can be generalized to all cities. More details 
are given in (Och and Weber, 1998; Och and Ney, 
1999). 
We demonstrate results of our bilingual clus- 
tering method for two different bilingual corpora 
(see Tables 1 and 2). The EUTRANS-I corpus is 
a subtask of the "Traveller Task" (Vidal, 1997) 
which is an artificially generated Spanish-English 
corpus. The domain of the corpus is a human- 
to-human communication situation at a reception 
Table 3: Example of bilingual word classes (corpus 
EUTRANS-I, method BIL-2). 
El: how it pardon what when where which. 
who why 
E2: my our 
E3: today tomorrow 
E4: ask call make 
E5: carrying changing giving looking 
moving putting sending showing waking 
E6: full half quarter 
$1: c'omo cu'al cu'ando cu'anta d'onde 
dice dicho hace qu'e qui'en tiene 
$2: ll'eveme mi mis nuestra nuestras 
nuestro nuestros s'ub~nme 
$3: hoy manana mismo 
$4: hacerme ll'ameme ll'amenos llama 
llamar llamarme llamarnos llame p'idame 
p'idanos pedir pedirme pedirnos 
pida pide 
$5: cambiarme cambiarnos despertarme 
despertarnos llevar llevarme llevarnos 
subirme subirnos usted ustedes 
$6: completa cuarto media menos 
desk of a hotel. The EUTRANS-II corpus is a natu- 
ral German-English corpus consisting of different 
text types belonging to the domain of tourism: 
bilingual Web pages of hotels, bilingual touristic 
brochures and business correspondence. The tar- 
get language of our experiments is English. 
We compare the three described methods to 
generate bilingual word classes. The classes 
MONO are determined by monolingually opti- 
mizing source and target language classes with 
Eq. (4). The classes BIL are determined by bilin- 
gually optimizing classes with Eq. (10). The 
classes BIL-2 are determined by first optimiz- 
ing mono-lingually classes for the target language 
(English) and afterwards optimizing classes for 
the source language (Eq. (11) and Eq. (12)). 
For EUTRANS-I we used 60 classes and for 
EUTRANS-II we used 500 classes. We chose the 
number of classes in such a way that the final per- 
formance of the translation system was optimal. 
The CPU time for optimization of bilingual word 
classes on an Alpha workstation was under 20 sec- 
onds for EUTRANS-I and less than two hours for 
EUTRANS-II. 
Table 3 provides examples of bilingual word 
classes for the EUTRANS-I corpus. It can be seen 
that the resulting classes often contain words that 
are similar in their syntactic and semantic func- 
tions. The grouping of words with a different 
74 
Proceedings of EACL '99 
Table 4: Perplexity (PP) of different classes. 
Corpus MONO BIL BIL-2 
EUTRANS-I 2.13 197: 198: 
EUTRANS-II 13.2 . . 
Table 5: Average e-mirror of different classes. 
Corpus \[MONO \[BIL BIL-2 
EUTRANS-I I 3.5 I 2.6 2.6 
EUTRANS-II 2.2 1.8 2.0 
meaning like today and tomorrow does not im- 
ply that these words should be translated by the 
same Spanish word, but it does imply that the 
translations of these words are likely to be in the 
same Spanish word class. 
To measure the quality of our bilingual word 
classes we applied two different evaluation mea- 
sures: 
1. Average e-mirror size (Wang et al., 1996): 
The e-mirror of a class E is the set of classes 
which have a translation probability greater 
than e. We use e = 0.05. 
2. The perplexity of the class transition proba- 
bility on a bilingual test corpus: 
exp j-1. y~ maxi log (p (g (fj) Ig (ei))) 
j=l 
Both measures determine the extent to which the 
translation probability is spread out. A small 
value means that the translation probability is 
very focused and that the knowledge of the source 
language class provides much information about 
the target language class. 
Table 4 shows the perplexity of the obtained 
translation lexicon without word classes, with 
monolingual and with bilingual word classes. As 
expected the bilingually optimized classes (BIL, 
BIL-2) achieve a significantly lower perplexity and 
a lower average e-mirror than the mono-lingually 
optimized classes (MONO). 
The tables 6 and 7 show the translation qual- 
ity of the statistical machine translation system 
described in (Och and Weber, 1998) using no 
classes (WORD) at all, mono-lingually, and bi- 
lingually optimized word classes. The trans- 
lation system was trained using the bilingual 
training corpus without any further knowledge 
sources. Our evaluation criterion is the word er- 
ror rate (WER) -- the minimum number of in- 
Table 6: Word error rate (WER) and average 
alignment template length (AATL) on EUTRANS- 
I. 
Method 
WORD 
MONO 
BIL 
BIL-2 
WER \[70\] I AAq'L 
6.31 2.85 
5.64 5.03 
5.38 4.40 
4.76 5.19 
Table 7: Word error rate (WER) and average 
alignment template length (AATL) on EUTRANS- 
II. 
Method 
WORD 
MONO 
BIL 
BIL-2 
WER \[%\] I AATL 
64.3 1.36 
63.5 1.74 
63.2 1.53 
62.5 1.54 
sertions/deletions/substitutions relative to a ref- 
erence translation. 
As expected the translation quality improves 
using classes. For the small EuTRANS-I task the 
word error rates reduce significantly. The word er- 
ror rates for the EUTRANS-II task are much larger 
because the task has a very large vocabulary and is 
more complex. The bilingual classes show better 
results than the monolingual classes MONO. One 
explanation for the improvement in translation 
quality is that the bilingually optimized classes 
result in an increased average size of used align- 
ment templates. For example the average length 
of alignment templates with the EUTRANS-I cor- 
pus using WORD is 2.85 and using BIL-2 it is 
5.19. The longer the average alignment template 
length, the more context is used in the translation 
and therefore the translation quality is higher. 
An explanation for the superiority of BIL-2 
over BIL is that by first optimizing the English 
classes mono-lingually, it is much more probable 
that longer sequences of classes occur more often 
thereby increasing the average alignment template 
size. 
6 Summary and future works 
By applying a maximum-likelihood approach to 
the joint probability of a parallel corpus we ob- 
tained an optimization criterion for bilingual word 
classes which is very similar to the one used in 
monolingual maximum-likelihood word clustering. 
For optimization we used the exchange algorithm. 
The obtained word classes give a low translation 
lexicon perplexity and improve the quality of sta- 
75 
Proceedings of EACL '99 
tistical machine translation. 
We expect improvements in translation quality 
by allowing that words occur in more than one 
class and by performing a hierarchical clustering. 
Acknowledgements This work has been par- 
tialIy supported by the European Community un- 
der the ESPRIT project number 30268 (EuTrans). 
References 
P. F. Brown, V. J. Della Pietra, P. V. deSouza, 
J. C. Lai, and R. L. Mercer. 1992. Class-based 
n-gram models of natural language. Computa- 
tional Linguistics, 18(4):467-479. 
Peter F. Brown, Stephen A. Della Pietra, Vin- 
cent J. Della Pietra, and Robert L. Mercer. 
1993. The mathematics of statistical machine 
translation: Parameter estimation. Computa- 
tional Linguistics, 19 (2) :263-311. 
G. Dueck and T. Scheuer. 1990. Threshold ac- 
cepting: A general purpose optimization al- 
gorithm appearing superior to simulated an- 
nealing. Journal of Computational Physics, 
90(1):161-175. 
Pascale Fung and Dekai Wu. 1995. Coerced 
markov models for cross-lingual lexical-tag re- 
lations. In The Sixth International Conference 
on Theoretical and Methodological Issues in Ma- 
chine Translation, pages 240-255, Leuven, Bel- 
gium, July. 
M. Jardino and G. Adda. 1993. Automatic Word 
Classification Using Simulated Annealing. In 
Proc. Int. Conf. on Acoustics, Speech, and Sig- 
nal Processing, volume 2, pages 41-44, Min- 
neapolis. 
R. Kneser and H. Ney. 1991. Forming Word 
Classes by Statistical Clustering for Statistical 
Language Modelling. In 1. Quantitative Lin- 
guistics Conference. 
R. Kneser and H. Ney. 1993. Improved Cluster- 
ing Techniques for Class-Based Statistical Lan- 
guage Modelling. In European Conference on 
Speech Communication and Technology, pages 
973-976. 
Sven Martin, JSrg Liermann, and Hermann Ney. 
1998. Algorithms for bigram and trigram word 
clustering. Speech Communication, 24(1):19- 
37. 
Franz Josef Och and Hermann Ney. 1999. The 
alignment template approach to statistical ma- 
chine translation. To appear. 
Franz Josef Och and Hans Weber. 1998. Im- 
proving statistical natural language translation 
with categories and rules. In Proceedings of the 
35th Annual Conference of the Association for 
Computational Linguistics and the 17th Inter- 
national Conference on Computational Linguis- 
tics, pages 985-989, Montreal, Canada, August. 
Enrique Vidal. 1997. Finite-state speech-to- 
speech translation. In Proc. Int. Conf. on 
Acoustics, Speech, and Signal Processing, vol- 
ume 1, pages 111-114. 
Stephan Vogel, Hermann Ney, and Christoph Till- 
mann. 1996. HMM-based word alignment in 
statistical translation. In COLING '96: The 
16th Int. Conf. on Computational Linguistics, 
pages 836-841, Copenhagen, August. 
Ye-Yi Wang, John Laiferty, and Alex Waibel. 
1996. Word clustering with parallel spoken lan- 
guage corpora. In Proceedings of the ~th Inter- 
national Conference on Spoken Language Pro- 
cesing (ICSLP'96), pages 2364-2367. 
76 
