Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 612–617,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Discriminative Methods for Transliteration 
 
Dmitry Zelenko 
SRA International 
4300 Fair Lakes Ct. 
Fairfax VA 22033 
dmitry_zelenko@sra.com 
Chinatsu Aone 
SRA International 
4300 Fair Lakes Ct. 
Fairfax VA 22033 
chinatsu_aone@sra.com 
 
 
  
 
Abstract 
We present two discriminative methods 
for name transliteration. The methods 
correspond to local and global modeling 
approaches in modeling structured output 
spaces. Both methods do not require 
alignment of names in different lan-
guages – their features are computed di-
rectly from the names themselves. We 
perform an experimental evaluation of 
the methods for name transliteration from 
three languages (Arabic, Korean, and 
Russian) into English, and compare the 
methods experimentally to a state-of-the-
art joint probabilistic modeling approach. 
We find that the discriminative methods 
outperform probabilistic modeling, with 
the global discriminative modeling ap-
proach achieving the best performance in 
all languages.  
1 Introduction 
Name transliteration is an important task of tran-
scribing a name from alphabet to another. For 
example, an Arabic “مﺎﻴﻟو ”, Korean “윌리엄”, and 
Russian “Вильям ” all correspond to English 
“William”. We address the problem of translit-
eration in the general setting: it involves trying to 
recover original English names from their tran-
scription in a foreign language, as well as finding 
an acceptable spelling of a foreign name in Eng-
lish. 
We apply name transliteration in the context 
of cross-lingual information extraction. Name 
extractors are currently available in multiple lan-
guages. Our goal is to make the extracted names 
understandable to monolingual English speakers 
by transliterating the names into English. 
The extraction context of the transliteration 
application imposes additional complexity con-
straints on the task. In particular, we aim for the 
transliteration speed to be comparable to that of 
extraction speed. Since most current extraction 
systems are fairly fast (>1 Gb of text per hour), 
the complexity requirement reduces the range of 
techniques applicable to the transliteration. More 
precisely, we cannot use WWW and the web 
count information to hone in on the right translit-
eration candidate. Instead, all relevant translitera-
tion information has to be represented within a 
compact and self-contained transliteration model. 
We present two methods for creating and ap-
plying transliteration models. In contrast to most 
previous transliteration approaches, our models 
are discriminative. Using an existing translitera-
tion dictionary D (a set of name pairs {(f,e)}), we 
learn a function that directly maps a name f from 
one language into a name e in another language. 
We do not estimate either direct conditional 
p(e|f) or reverse conditional p(f|e) or joint p(e,f) 
probability models. Furthermore, we do away 
with the notion of alignment: our transliteration 
model does not require and is not defined of in 
terms of aligned e and f. Instead, all features 
used by the model are computed directly from 
the names f and e without any need for their 
alignment. 
The two discriminative methods that we pre-
sent correspond to local and global modeling 
paradigms for solving complex learning prob-
lems with structured output spaces. In the local 
setting, we learn linear classifiers that predict a 
letter e
i
 from the previously predicted letters 
e
1
…e
i-1
 and the original name f. In the global set-
ting, we learn a function W mapping a pair (f,e) 
into a score W(f,e)∈ R. The function W
 
is linear 
in features computed from the pair (f,e). We de-
scribe the pertinent feature spaces as well as pre-
612
sent both training and decoding algorithms for 
the local and global settings. 
We perform an experimental evaluation for 
three language pairs (transliteration from Arabic, 
Korean, and Russian into English) comparing 
our methods to a joint probabilistic modeling 
approach to transliteration, which was shown to 
deliver superior performance. We show experi-
mentally that both discriminative methods out-
perform the probabilistic approach, with global 
discriminative modeling achieving the best per-
formance in all languages. 
2 Preliminaries 
Let E and F be two finite alphabets. We will use 
lowercase latin letters e, f to denote letters e∈E, 
f∈F, and we use bold letters e∈E
*
, f∈F
*
 to de-
note strings in the corresponding alphabets. The 
subscripted e
i
, f
j
 denote ith and jth symbols of the 
strings e and f, respectively. We use e[i,j]
 
to rep-
resent a substring e
i
…e
j
 of e. If j<i, then e[i,j] is 
an empty string Λ. 
 
A transliteration model is a function mapping a 
string f to a string e. We seek to learn a translit-
eration model from a transliteration dictionary 
D={(f,e)}.  We apply the model in conjunction 
with a decoding algorithm that produces a string 
e from a string f. 
 
3 Local Transliteration Modeling 
In local transliteration modeling, we represent a 
transliteration model as a sequence of local pre-
diction problems. For each local prediction, we 
use the history h representing the context of mak-
ing a single transliteration prediction. That is, we 
predict each letter e
i
 based on the pair h=(e[1,i-
1], f) ∈ H.  
Formally, we map H×E into a d-dimensional 
feature space ϕ: H×E → R
d
, where each 
ϕ
k
(h,e)(k∈{1,..,d}) corresponds to a condition 
defined in terms of the history h and the cur-
rently predicted letter e. 
In order to model string termination, we aug-
ment E with a sentinel symbol $, and we append 
$ to each e from D.  
Given a transliteration dictionary D, we trans-
form the dictionary in a set of |E| binary learning 
problems. Each learning problem L
e
 corresponds 
to predicting a letter e∈E. More precisely, for a 
pair (f[1,m],e[1,n]) ∈ D and i ∈ {1,…,n}, we 
generate a positive example ϕ((e[1,i-1], f),e
i
) for 
the learning problem L
e
, where e=e
i
, and a nega-
tive example ϕ((e[1,i-1], f),e) for each L
e
, where 
e≠e
i
. 
Each of the learning problems is a binary clas-
sification problem and we can use our favorite 
binary classifier learning algorithm to induce a 
collection of binary classifiers {c
e
 : e∈E}. From 
most classifiers we can also obtain an estimate of 
conditional probability p(e|h) of a letter e given a 
history h. 
For decoding, in our experiments we use the 
beam search to find the sequence of letters (ap-
proximately) maximizing p(e|h).  
3.1 Local Features 
The features used in local transliteration model-
ing correspond to pairs of substrings of e and f. 
We limit the length of substrings as well as their 
relative location with respect to each other. 
• For ϕ((e[1,i-1], f),e), generate a feature 
for every pair of substrings (e[i-w,i-1],f[j-
v,j]), where 1≤w<W(E) and  0≤v<W(F) 
and |i-j| ≤ d(E,F). Here, W(·) is the upper 
bound on the length of strings in the corre-
sponding alphabet, and d(E,F) is the upper 
bound on the relative distance between 
substrings. 
• For ϕ((e[1,i-1], f[1,m]),e), generate the 
length difference feature ϕ
len
=i-m. In ex-
periments, we discretize ϕ
len
 to obtain 9 
binary features: ϕ
len
=l (l∈[-3,3]), ϕ
len 
≤ -4, 
4 ≤ ϕ
len
. 
• For ϕ((e[1,i-1], f[1,m]),e), generate a 
language modeling feature p(e| e
[1,i-1]
). 
• For ϕ((e[1,i-1], f),e) and i=1, generate 
“start” features: (^f
1
,^e), (^f
1
f
2
,^e). 
• For ϕ((e[1,i-1], f),e) and i=2, generate 
“start” features: (^f
1
,^e
1
e
2
), (^f
1
f
2
,^e
1
e
2
).  
• For ϕ((e[1,i-1], f),e) and e=$, generate 
“end” features: (f
m
$,e$), (f
m-1
f
m
$,e$). 
The parameters W(E), W(F), and d(E,F) are, in 
general, language-specific, and we will show, in 
the experiments, that different values of the pa-
rameters are appropriate for different languages. 
4 Global Transliteration Modeling 
In global transliteration modeling, we directly 
model the agreement function between f and e. 
We follow (Collins 2002) and consider the 
global feature representation Φ: F
*
×E
*
  → R
d
. 
613
Each global feature corresponds to a condition 
on the pair of strings. The value of a feature is 
the number of times the condition holds true for 
a given pair of strings. In particular, for every 
local feature ϕ
k
((e[1,i-1], f),e
i
) we can define the 
corresponding global feature: 
      )),],1,1[((),(
∑
−=Φ
i
ikk
ei feef ϕ         (1) 
We seek a transliteration model that is linear 
in the global features. Such a transliteration 
model is represented by d-dimensional weight 
vector W∈ R
d
. Given a string f, model applica-
tion corresponds to finding a string e such that  
∑
Φ=
k
kk
W ),(maxarg
e'
e'fe             (2) 
As with the case of local modeling, due to 
computational constraints, we use beam search 
for decoding in global transliteration modeling. 
(Collins 2002) showed how to use the Voted 
Perceptron algorithm for learning W, and we use 
it for learning the global transliteration model. 
We use beam search for decoding within the 
Voted Perceptron training as well. 
4.1 Global Features 
The global features used in local transliteration 
modeling directly correspond to local features 
described in Section 3.1.  
• For e[1,n] and f[1,m], generate a feature 
for every pair of substrings (e[i-w,i],f[j-
v,j]), where 1≤w<W(E) and  0≤v<W(F) 
and |i-j| ≤ d(E,F).  
• For e[1,n] and f[1,m], generate the 
length difference feature Φ
len
=n-m. In ex-
periments, we discretize Φ
len
 to obtain 9 
binary features: Φ
len
=l (l∈[-3,3]), ϕ
len 
≤ -4, 
4 ≤ ϕ
len
. 
• For e[1,n], generate a language model-
ing feature (p(e))
1/n
. 
• For e[1,n] and f[1,m],, generate “start” 
features: (^f
1
,^e
1
), (^f
1
f
2
,^e
1
), (^f
1
,^e
1
e
2
), 
(^f
1
f
2
,^e
1
e
2
).  
• For e[1,n] and f[1,m], generate “end” 
features: (f
m
$,e
n
$), (f
m-1
f
m
$,e
n
). 
5 Joint Probabilistic Modeling 
We compare the discriminative approaches to a 
joint probabilistic approach to transliteration in-
troduced in recent years. 
In the joint probabilistic modeling approach, 
we estimate a probability distribution p(e,f). We 
also postulate hidden random variables a repre-
senting the alignment of e and f. An alignment a 
of e and f is a sequence a
1
,a
2
,…a
L
, where a
l
 =  
(e[i
l
-w
l
,i
l
],f[j
l
-v
l
,j
l
]), i
l-1
+1=i
l
-w
l
, and j
l-1
+1=j
l
-v
l
. 
Note that we allow for at most one member of a 
pair a
l
 to be an empty string. 
Given an alignment a, we define the joint 
probability p(e,f|a): 
]),[],,[()|,(
l
l
lllll
jvjiwipp
∏
−−= feafe
 
We learn the probabilities p(e[i
l
-w
l
,i
l
],f[j
l
-v
l
,j
l
]) 
using a version of EM algorithm. In our experi-
ments, we use the Viterbi version of the EM al-
gorithm: starting from random alignments of all 
string pairs in D, we use maximum likelihood 
estimates of the above probabilities, which are 
then employed to induce the most probable 
alignments in terms of the probability estimates. 
The process is repeated until the probability es-
timates converge. 
During the decoding process, given a string f, 
we seek both a string e and an alignment a such 
that p(e,f|a) is maximized. In our experiments, 
we used beam search for decoding. 
Note that with joint probabilistic modeling use 
of a language model p(e) is not strictly neces-
sary. Yet we found out experimentally that an 
adaptive combination of the language model with 
the joint probabilistic model improves the trans-
literation performance. We thus combine the 
joint log-likelihood log(p(e,f|a)) with log(p(e)): 
score(e|f) = log(p(e,f|a))+ αlog(p(e))          (3) 
We estimate the parameter α on a held-out set 
by generating, for each f, the set of top K=10 
candidates with respect to log(p(e,f|a)), then us-
ing (3) for re-ranking the candidates, and picking 
α to minimize the number of transliteration er-
rors among re-ranked candidates.  
6 Experiments 
We present transliteration experiments for three 
language pairs. We consider transliteration from 
Arabic, Korean, and Russian into English. For all 
language pairs, we apply the same training and 
decoding algorithms.  
6.1 Data 
The training and testing transliteration dataset 
sizes are shown in Table 1. For Arabic and Rus-
sian, we created the dataset manually by keying 
in and translating Arabic, Russian, and English 
names. For Korean, we obtained a dataset of 
transliterated names from a Korean government 
website. The dataset contained mostly foreign 
614
names transliterated into Korean. All datasets 
were randomly split into training and (blind) test-
ing parts.  
 
 Training Testing 
Arabic 935 233 
Korean 11973 1363 
Russian 545 121 
Table 1. Transliteration Data. 
 
Prior to transliteration, the Korean words of 
the Korean transliteration data were converted 
from their Hangul (syllabic) representation to 
Jamo (letter-based) representation to effectively 
reduce the alphabet size for Korean. The conver-
sion process is completely automatic (see Uni-
code Standard 3.0 for details). 
6.2 Algorithm Details 
For language modeling, we used the list of 
100,000 most frequent names downloaded from 
the US Census website. Our language model is a  
5-gram model with interpolated Good-Turing 
smoothing (Gale and Sampson 1995). 
We used the learning-to-classify version of 
Voted Perceptron for training local models 
(Freund and Schapire 1999). We used Platt’s 
method for converting scores produced by 
learned linear classifiers into probabilities (Platt 
1999). We ran both local and global Voted Per-
ceptrons for 10 iterations during training.  
6.3 Transliteration Results 
 Our discriminative transliteration models 
have a number of parameters reflecting the 
length of strings chosen in either language as 
well as the relative distance between strings. 
While we found that choice of W(E)=W(F) = 2 
always produces the best results for all of our 
languages, the distance d(E,F) may have differ-
ent optimal values for different languages.  
Table 2 presents the transliteration results for 
all languages for different values of d. Note that 
the joint probabilistic model does not depend on 
d. The results reflect the accuracy of translitera-
tion, that is, the proportion of times when the top 
English candidate produced by a transliteration 
model agreed with the correct English translitera-
tion. We note that such an exact comparison may 
be too inflexible, for many foreign names may 
have more than one legitimate English spelling. 
In future experiments, we plan to relax the re-
quirement and consider alternative variants of 
transliteration scoring (e.g., edit distance, top-N 
candidate scoring). 
 
 Local Global Prob 
Arabic (d=1) 31.33 32.61 
Arabic (d=2) 30.04 30.04 
Arabic (d=3) 26.61 27.03 
 
25.75 
 
Korean (d=1) 26.93 30.44 
Korean (d=2) 28.84 34.26 
Korean (d=3) 30.96 35.28 
 
26.93 
 
Russian (d=1) 44.62 46.28 
Russian (d=2) 38.84 41.32 
Russian (d=3) 38.01 38.01 
 
39.67 
 
Table 2. Transliteration Results for Different  
              Values of Relative Distance (d). 
 
Table 2 shows that, for all three languages, the 
discriminative methods convincingly outperform 
the joint probabilistic approach. The global dis-
criminative approach achieves the best perform-
ance in all languages. It is interesting that differ-
ent values of relative distance are optimal for 
different languages. For example, in Korean, the 
Hangul-Jamo decomposition leads to fairly re-
dundant strings of Korean characters thereby 
making transliterated characters to be relatively 
far from each other. Therefore, Korean requires a 
larger relative distance bound. In Arabic and 
Russian, on the other hand, transliterated charac-
ters are relatively close to each other, so the dis-
tance d of 1 suffices. While for Russian such a 
small distance is to be expected, we are surprised 
by such a small relative distance for Arabic. Our 
intuition was that omitting short vowels in spell-
ing names in Arabic will increase d.  
We have the following explanation of the low 
value of d for Arabic from the machine learning 
perspective: incrementing d implies adding a lot 
of extraneous features to examples, that is, in-
creasing attribute noise. Increased attribute noise 
requires a corresponding increase in the number 
of training examples to achieve adequate per-
formance. While for Korean the number of train-
ing examples is sufficient to cope with the attrib-
ute noise, the relatively small Arabic training 
sample is not. We hypothesize that with increas-
ing the number of training examples for Arabic, 
the optimal value of d will also increase. 
7 Related Work 
Most work on name transliteration adopted a 
source-channel approach (Knight and Grael 
1998; Al-Onaizan and Knight 2002a; Virga and 
Khudanpur 2003; Oh and Choi 2000) incorporat-
615
ing phonetics as an intermediate representation. 
(Al-Onaizan and Knight 2002) showed that use 
of outside linguistic resources such as WWW 
counts of transliteration candidates can greatly 
boost transliteration accuracy. (Li et al. 2004) 
introduced the joint transliteration model whose 
variant augmented with adaptive re-ranking we 
used in our experiments. 
Among direct (non-source-channel) models, 
we note the work of (Gao et al. 2004) on apply-
ing Maximum Entropy to English-Chinese trans-
literation, and the English-Korean transliteration 
model of (Kang and Choi 2000) based on deci-
sion trees. 
All of the above models require alignment be-
tween names. We follow the recent work of 
(Klementiev and Roth 2006) who addressed the 
problem of discovery of transliterated named 
entities from comparable corpora and suggested 
that alignment may not be necessary for translit-
eration. 
Finally, our modeling approaches follow the 
recent  work on both local classifier-based mod-
eling of complex learning problems (McCallum 
et al. 2000; Punyakanok and Roth 2001), as well 
as global discriminative approaches based on 
CRFs (Lafferty et al. 2001), SVM (Taskar et al. 
2005), and the Perceptron algorithm (Collins 
2002) that we used in our experiments. 
 
8 Conclusions 
We presented two novel discriminative ap-
proaches to name transliteration that do not em-
ploy the notion of alignment. We showed ex-
perimentally that the approaches lead to superior 
experimental results in all languages, with the 
global discriminative modeling approach achiev-
ing the best performance. 
The results are somewhat surprising, for the 
notion of alignment seems very intuitive and use-
ful for transliteration. We will investigate 
whether similar alignment-free methodology can 
be extended to full-text translation. It will also be 
interesting to study the relationship between our 
discriminative alignment-free methods and re-
cently proposed discriminative alignment-based 
methods for transliteration and translation 
(Taskar et al. 2005a; Moore 2005). 
We also showed that for name transliteration, 
global discriminative modeling is superior to 
local classifier-based discriminative modeling. 
This may have resulted from poor calibration of 
scores and probabilities produced by individual 
classifiers. We plan to further investigate the re-
lationship between the local and global ap-
proaches to complex learning problems in natural 
language. 

References  
Y. Al-Onaizan and K. Knight. 2002. Translating 
Named Entities Using Monolingual and Bilingual 
Resources. Proceedings of ACL. 
Y. Al-Onaizan and K. Knight. 2002a. Machine Trans-
literation of Names in Arabic Text. Proceedings of 
ACL Workshop on Computational Approaches to 
Semitic Languages. 
M. Collins. 2002. Discriminative Training for Hidden 
Markov Models: Theory and Experiments with 
Perceptron Algorithms. In Proceedings of EMNLP. 
Y. Freund and R. Shapire. 1999. Large margin clas-
sification using the perceptron algorithm. Machine 
Learning, 37, 277–296. 
W. Gale and G. Sampson. 1995. Good-Turing fre-
quency estimation without tears. Journal of Quan-
titative Linguistics 2:217-235. 
Gao Wei, Kam-Fai Wong, and Wai Lam. 2004. Pho-
neme-based transliteration of foreign names for 
OOV problem. Proceedings of the First Interna-
tional Joint Conference on Natural Language 
Processing. 
B.J. Kang and Key-Sun Choi, 2000. Automatic Trans-
literation and Back-transliteration by Decision Tree 
Learning, Proceedings of the 2nd International 
Conference on Language Resources and Evalua-
tion. 
A. Klementiev and D. Roth. 2006. Named Entity 
Transliteration and Discovery from Multilingual 
Comparable Corpora. Proceedings of ACL. 
K. Knight and J. Graehl. 1998. Machine Translitera-
tion, Computational Linguistics, 24(4). 
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
ditional Random Fields: Probabilistic Models for 
Segmenting and Labeling Sequence Data. Proceed-
ings of the Eighteenth International Conference on 
Machine Learning. 
Li Haizhou, Zhang Min, and Su Jian. 2004. A Joint 
Source-channel Model for Machine Transliteration. 
Proceedings of ACL 2004. 
A. McCallum, D. Freitag, and F. Pereira. 2000. Maxi-
mum entropy Markov models for information ex-
traction and segmentation. Proceedings of ICML. 
R. Moore. 2005. A Discriminative Framework for 
Bilingual Word Alignment. Proceedings of the 
Conference on Empirical Methods in Natural Lan-
guage Processing. 
Jong-Hoon Oh and Key-Sun Choi. 2000. An English-
Korean Transliteration Model Using Pronunciation 
and Contextual Rules. Proceedings of COLING. 
J. Platt. 1999. Probabilistic outputs for support vector 
machines and comparison to regularized likelihood 
methods. In Advances in Large Margin Classi£ers. 
V. Punyakanok and D. Roth. 2001. The Use of Classi-
fiers in Sequential Inference. Proceedings of the 
Conference on Advances in Neural Information 
Processing Systems. 
B. Taskar, V. Chatalbashev, D. Koller and C. Gues-
trin. 2005. Learning Structured Prediction Models: 
A Large Margin Approach. Proceedings of Twenty 
Second International Conference on Machine 
Learning. 
B. Taskar, S. Lacoste-Julien, and D. Klein. 2005a. A 
Discriminative Matching Approach to Word Align-
ment. Proceedings of the Conference on Empirical 
Methods in Natural Language Processing. 
P. Virga and S. Khudanpur. 2003. Transliteration of 
Proper Names in Cross-lingual Information Re-
trieval. Proceedings of ACL 2003 workshop 
MLNER. 
