English-to-Korean Transliteration using Multiple Unbounded 
Overlapping Phoneme Chunks 
In-Ho Kang and GilChang Kim 
Department of Computer Science 
Korea Advanced Institute of Science and Technology 
Abstract 
We present in this paper the method of 
English-to-Korean(E-K) transliteration and 
back-transliteration. In Korean technical 
documents, many English words are translit- 
erated into Korean words in w~rious tbrms in 
diverse ways. As English words and Korean 
transliterations are usually technical terms and 
proper nouns, it; is hard to find a transliteration 
and its variations in a dictionary. Theretbre 
an automatic transliteration system is needed 
to find the transliterations of English words 
without manual intervention. 
3?.o explain E-K transliteration t)henomena, 
we use phoneme chunks that do not have a 
length limit. By aI)plying phoneme chunks, 
we combine different length intbrmation with 
easy. The E-K transliteration method h~m 
three stet)s. In the first, we make a t)honeme 
network that shows all possible transliterations 
of the given word. In the second step, we apply 
t)honeme chunks, extracted fl'om training data, 
to calculate the reliability of each possible 
transliteration. Then we obtain probable 
transliterations of the given English word. 
1 Introduction 
In Korean technical documents, many English 
words are used in their original forms. But 
sometimes they are transliterated into Korean 
in different forms. Ex. 1, 2 show the examples 
of w~rious transliterations in KTSET 2.0(Park 
et al., 1996). 
(1) data 
(a) l:~\] o\] >\]-(teyitha) \[1,033\] 1 
(b) r~l o\] >\] (teyithe) \[527\] 
1the frequency in KTSET 
(2) digital 
(a) ~\] x\] ~-(ticithul) \[254\] 
(b) qN~-(tichithM)\[7\] 
(c) ~1 z\] ~ (ticithel) \[6\] 
These various transliterations are not negligi- 
ble tbr natural langnage processing, especially ill 
information retrieval. Because same words are 
treated as different ones, the calculation based 
(m tile frequency of word would produce mis- 
leading results. An experiment shows that the 
effectiveness of infbrmation retrieval increases 
when various tbrms including English words are 
treated eqnivMently(Jeong et al., 1997). 
We may use a dictionary, to tind a correct 
transliteratkm and its variations. But it is not 
fhasible because transliterated words are usually 
technical terms and proper nouns that have rich 
productivity. Therefore an automatic translit- 
eration system is needed to find transliterations 
without manual intervention. 
There have been some studies on E-K 
transliteration. They tried to explain translit- 
eration as tflloneme-lmr-phoneme or alphabet- 
per-phonenm classification problem. They re- 
stricted the information length to two or three 
units beibre and behind an input unit. In tact, 
ninny linguistic phenomena involved in the E-K 
transliteration are expressed in terms of units 
that exceed a phoneme and an alphabet. For 
example, 'a' in 'ace' is transliterated into %11 
°l (ey0" lint in 'acetic', "ot (eft and ill 'acetone', 
"O}(a)". If we restrict the information length 
to two alphabets, then we cmmot explain these 
phenomena. Three words ge~ the same result 
~()r ~ a'. 
(3) ace otl o\] ~(eyisu) 
(4) acetic cq.q v I (esithik) 
418 
(5) acetone ol-J~ll ~-(aseython) 
In this t)at)er, we t)rot)ose /;he E-K transliter- 
al;ion model t)ased on l)honeme chunks that 
do not have a length limit and can explain 
transliter;~tion l)henolnem, in SOllle degree of 
reliability. Not a alphal)et-per-all)habet but a 
chunk-i)er-chunk classification 1)roblem. 
This paper is organized as tbllows. In section 
2, we survey an E-K transliteration. \]111 section 
3, we propose, phonenm chunks 1)asexl translit- 
eration and back-transliteration. In Seel;ion 4, 
the lesults of ext)erilnents are presented. Fi- 
nally, the con(:hlsion follows in section 5. 
2 English-to-Korean transliteration 
E-K transliteration models are (:lassitied in two 
methods: the l)ivot method and the direct 
method. In the pivot method, transliteration 
is done in two steps: (:onverting English words 
ill|;() pronunciation symbols and then (:onverting 
these symbols into Kore~m wor(ts by using the 
Korean stm~(tard conversion rule. In the direct 
method, English words are directly converted to 
Korean words without interlnediate stct)s. An 
exl)eriment shows that the direct method is bet- 
ter than the pivot method in tin(ling wtriations 
of a transliteration(Lee and (~hoi, 1998). Statis- 
ti(:al information, neural network and de(:ision 
tree were used to imt)lelneld; the direct method. 
2.1 Statistieal Transliteration method 
An English word is divided into phoneme se- 
quence or alphal)et sequence as (21~(22~... ~e n. 
Then a corresponding Korean word is rel)- 
resented as kl, k2,... , t~:n. If n correspond- 
ing Korean character (hi) does not exist, we 
fill the blank with '-'. For example, an En- 
glish word "dressing" and a Korean word "> 
N\] zg (tuleysing)" are represented as Fig. 1. The 
ut)per one in Fig. 1 is divided into an English 
phoneme refit and the lower one is divided into 
an alphabet mlit. 
dressh, g :---~ ~1~ 
d/~+r/a +e/ 4l + ss/ x +i/ I +n-g/ o 
l d/=---+r/e +e/41 +s/~.+s/-+i/ I +n/o +g/- 
Figure 1: An E-K transliteration exault)le 
The t)roblem in statistical transliteration 
reel;hod is to lind out the. lllOSt probable translit- 
eration fbr a given word. Let p(K) be the 1)tel)- 
ability of a Korean word K, then, for a given 
English word E, the transliteration probal)ility 
of a word K Call be written as P(KIE). By using 
the Bayes' theorem, we can rewrite the translit- 
eration i)rol)lem as follows: 
,a'.q maz p(KIE) = a,..q ma:,: p(K)p(~IK) (1) 
K K 
With the Markov Independent Assulni)tion , 
we apl)roximate p(K) and p(EIK) as tbllows: 
7~ 
i=2 
i=1 
As we do not know the t)rommciation of a 
given word, we consider all possible tfllonelne 
sequences, l?or exanlple, 'data' has tbllowing 
possible t)holmme sequences, 'd-a-t-a, d-at-a, 
da-ta, ...'. 
As the history length is lengthened, we. can 
get more discrimination. But long history in- 
fornlation c~mses a data sl)arseness prol)lenl. In 
order to solve, a Sl)arseness t)rol)len~, Ma.ximmn 
Entropy Model, Back-off, and Linear intert)ola- 
tion methods are used. They combine different 
st~tistical estimators. (Tae-il Kim, 2000) use u t) 
to five phonemes in feature finlction(Berger et 
a,l., 1996). Nine %ature flmctions are combined 
with Maximum Entrot)y Method. 
2.2 Neural Network and Decision Tree 
Methods based 011 neural network and decision 
tree detenninistically decide a Korean charac- 
ter for a given English input. These methods 
take two or three alphabets or t)honemes as 
an input and generate a Korean alphabet 
or phoneme as an output. (Jung-.\]ae Kim, 
1.999) proposed a neural network method that 
uses two surrom~ding t)holmmes as an intmt. 
(Kang, 1999) t)roposed a decision tree method 
that uses six surrounding alphabets. If all 
inl)ut does not cover the phenomena of prol)er 
transliterations, we cammt gel; a correct answer. 
419 
Even though we use combining methods to 
solve the data sparseness problem, the increase 
of an intbrmation length would double the 
complexity and the time cost of a problem. It 
is not easy to increase the intbrmation length. 
To avoid these difficulties, previous studies 
does not use previous outputs(ki_z). But it 
loses good information of target . 
Our proposed method is based on the direct 
method to extract the transliteration and its 
variations. Unlike other methods that deter- 
mine a certain input unit's output with history 
information, we increase the reliability of a cer- 
tain transliteration, with known E-K transliter- 
ation t)henonmna (phoneme chunks). 
3 Transliteration using Multiple 
unbounded overlapping phoneme 
chunks 
For unknown data, we can estimate a Korean 
transliteration ti'onl hand-written rules. We 
can also predict a Korean transliteration with 
experimental intbrmation. With known English 
and Korean transliteration pairs, we can as- 
sume possible transliterations without linguistic 
knowledge. For example, 'scalar" has common 
part with 'scalc:~sqlN (suhhcyil)', ' casinoJ\[ 
xl  (t:hacino)', 't:oala:   e-l-&:hoalla)', and 
'car:~l-(kh.a)' (Fig. 2). We can assume possible 
transliteration with these words and their 
transliterations. From 'scale' and its transliter- 
ation l'-~\] ~ (sukheyil), the 'sc' in 'scalar' can be 
transliterated as '~:-J(sukh)'. From a 'casino' 
example, the 'c' has nlore evidence that can be 
transliterated as 'v (kh)'. We assume that we 
can get a correct Korean transliteration, if we 
get useful experinlental information and their 
proper weight that represents reliability. 
3.1 The alignment of an English word 
with a Korean word 
We can align an English word with its translit- 
eration in alphabet unit or in phoneme unit. 
Korean vowels are usually aligned with English 
vowels and Korean consonants are aligned with 
English consonants. For example, a Korean 
consonant, '1~ (p)' can be aligned with English 
consonants 'b', 'p', and 'v'. With this heuristic 
we can align an English word with its translit- 
eration in an alphabet unit and a t)honeIne unit 
with the accuracy of 99.4%(Kang, 1999). 
s c a 1 a r 
s c a 1 e 
k o a 
1 n o 
I L ± 
C a r 
F 
Figure 2: the transliteration of 'scalar : ~ 
~\]-(sukhalla)' 
3.2 Extraction of Phoneme Chunks 
From aligned training data, we extract phoneme 
clumks. We emmw.rate all possible subsets of 
the given English-Korean aligned pair. During 
enumerating subsets, we add start and end posi- 
tion infbrmation. From an aligned data "dress- 
ing" and "~etl N (tuleysing)", we can get subsets 
as Table 12. 
Table 1: The extraction of phoneme chunks 
Context Output 
d 
r. # 
¢,_)dr'c d/=--(d)+r/~- (r')+e/ql (ey) 
The context stands tbr a given English al- 
phabets, and the output stands for its translit- 
eration. We assign a proper weight to each 
phoneme chunk with Equation 4. 
C ( output ) wei.qh, t(contcxt : output) - C(contcxt) (4) 
C(x) means tile frequency of z in training data. 
Equation 4 shows that the ambiguous phe- 
nomenon gets the less evidence. The clnmk 
weight is transmitted to each phoneme symbol. 
To compensate for the length of phoneme, we 
multiply the length of phoneme to the weight of 
the phoneme chunk(Fig. 3). 
2@ means the start and end position of a word 
420 
weight(surfing: s/Z- + ur/4 + if= + i/l + rig/o) = (Z 
4, 4, 4, 4, 4, 
o~ 2a a a 2o~ 
\]?igure 3: The weight of a clmnk and a t)honeme 
This chunk weight does not mean the. relia- 
t)ility of a given transliteration i)henomenon. 
We know real reliM)itity, after all overlapping 
phonenm chunks are applied. The chunk that 
has some common part with other chunks 
gives a context information to them. Therefore 
a chunk is not only an int)ut unit but also 
a means to (-Mculate the reliability of other 
dmnks. 
We also e, xl;ra(:t the connection information. 
From Migned training (b:~ta, we obtain M1 pos- 
sible combinations of Korem~ characters and 
English chara(:ters. With this commction in- 
tbrmation, we exclude iml)ossit)h; connections 
of Korean characters ~md English phon(;nte se- 
quences. We can gel; t;he following (:ommction 
information from "dressing" examph'.(~12fl)le 2). 
2?fl)le 2: Conne(:tion Information 
Enffli.sh, Kore.a',. 1 
\]lql't\[righ, t II Z(¢lt l ,.,:.,1,,t / 
a ,. (,9 
, ('. 09 
3.3 A Transliteration Network 
For a given word, we get all t)ossil)h~ t)honemes 
and make a Korean transliteration network. 
Each node in a net;work has an English t)honent(; 
and a ('orrcspondillg Korean character. Nodes 
are comm(:ted with sequence order. For exam- 
ple, 'scalar' has the Kore, an transliteration net- 
work as Fig. 4. In this network, we dis('ommct 
some no(les with extracted (:onne('tion infornla- 
tion. 
After drawing the Korean tr~msliteration net- 
work, we apply all possible phone, me, chunks 
to the. network. Each node increases its own 
weight with the weight of t)honeme symbol in a 
phoneme chunks (Fig. 5). By overlapping the 
weight, nodes in the longer clmnks get; more ev- 
idence. Then we get the best t)ath that has the 
Figure 4: Korean Transliteration Network for 
'scalar' 
highest sum of weights, with the Viterbi algo- 
ril, hm. The Tree.-Trcllis Mgorithm is used to gel; 
the variations(Soong and Huang, 1991). 
Figure 5: Weight aptflication examt)le 
4 E-K back-transliteration 
E-K back transliteration is a more difficult prot)- 
lem thtnt F,-K trmlsliteration. During the E-K 
trm~slit;cra|;ion~ (lifli'xent alphabets are treated 
cquiw~h'.ntly. \],~)r exmnph'., ~f, t / mM ~v~ b' 
spectively and the long sound and the short 
strand are also treated equivalently. Therefim', 
the number of possible English phone, rues per 
a Korean character is bigger than the number 
of Korean characters per an English phoneme. 
The ambiguity is increased. In E-K back- 
transliteration, Korean 1)honemes and English 
phoneme, s switch their roles. Just switching the 
position. A Korean word ix Migned with an 
English word in a phoneme unit or a character 
refit (Fig. 6). 
\[ ~---~l~ : dressing\] 
F/d+--/-+ ~/r+ 41/e+~-/ss+ I /i+o/n 9 ,~ 
Figure 6: E-K back-transliteration examt)le 
421 
5 Experiments 
Experiments were done in two points of view: 
the accuracy test and the variation coverage 
test. 
5.1 Test Sets 
We use two data sets for an accuracy test. Test 
Set I is consists of 1.,650 English and Korean 
word pairs that aligned in a phoneme unit. It 
was made by (Lee and Choi, 1998) and tested by 
many methods. To compare our method with 
other methods, we use this data set. We use 
same training data (1,500 words) and test data 
(150 words). Test Set II is consists of 7,185 
English and Korean word paii's. We use Test 
Set H to show the relation between the size of 
training data and the accuracy. We use 90% 
of total size as training data and 10% as test 
data. For a variation coverage test, we use Test 
Set III that is extracted from KTSET 2.0. Test 
Set HI is consists of 2,391 English words and 
their transliterations. An English word has 1.14 
various transliterations in average. 
5.2 Evaluation functions 
Accuracy was measured by the percentage of 
the number of correct transliterations divided 
by the number of generated transliterations. We 
(:all it as word accuracy(W.A.). We use one 
more measure, called character accuracy(C.A.) 
that measures the character edit distance be- 
tween a correct word and a generated word. 
no. of correct words 
W.A. = no. o.f .qenerated words (5) 
C.A. = L (6) 
where L is the length of the original string, and 
i, d, mid s are the number of insertion, deletion 
and substitution respectively. If the dividend is 
negative (when L < (i + d + s)), we consider it 
as zero(Hall and Dowling, 1980). 
For the real usage test, we used variation cov- 
erage (V.C.) that considers various usages. We 
evaluated both tbr the term frequency (tf) and 
document frequency (d J), where tfis the number 
of term appearance in the documents and df is 
the number of documents that contain the term. 
If we set the usage tf (or d./) of the translitera- 
tions to 1 tbr each transliteration, we can calcu- 
late the transliteration coverage tbr the unique 
word types, single .frequency(.sf). 
V.C. = {if, df, s f} of found words (7) 
{t.f, 4f, <f} of  ,sed   o,'ds 
5.3 Accuracy tests 
We compare our result \[PCa, PUp\] a with the 
simple statistical intbrmation based model(Lee 
and Choi, 1998) \[ST\], the Maxinmm Entropy 
based model(Tae-il Kim, 2000) \[MEM\], the 
Neural Network model(Jung-Jae Kim, 1999) 
INN\] and the Decision %'ee based model(Kang, 
1999)\[DT\]. Table 3 shows the result of E- 
K transliteration and back-transliteration test 
with Test ,get L 
Table 3: C.A. and W.A. with Test Set I 
E-K trans. 
method C.A. I W.A. 
ST 69.3% 40.7% 4 
MEM 72.3% 43.3% 
NN 79.0% 35.1% 
DT 78.1% 37.6% 
Pep 86.5% 55.3% 
PCa 85.3% 46.7% 
E-K back trans. 
C.A. \[ W.A. 
60.5% 
77.1% 31.0% 
81.4% 34.7% 
79.3% 32.6% 
95 
85 
75 
65 
55 
45 
35 
Fig. 7, 8 show the results of our proposed 
method with the size of training data, Test Set 
II. We compare our result with the decision tree 
based method. 
~----~ I~C'A'PC° I 
i--~- w.A DT I 
+ W.A. POp 
~ W,A. BCa 
J J J ~ J i 
1000 2000 3000 4000 5000 6000 
Figure 7: E-K transliteration results with Test 
Set H 
aPC stands for phoneme chunks based method and 
a and b stands for aligned by an alphabet unit and a 
1)honeme unit respectively 
4with 20 higher rank results 
422 
90 
80- 
70 
6O 
5O 
40 
30 i 
20 t 
1000 2000 3000 4000 5000 6000 
"--¢-- C.A, DT 
---U-- C.A. PCp 
C.A. PCa I !--x- ~A. Dr i 
I.__ 144A. POp\] 
I.~l~ W,A. POa 
Figure 8: E-K back-transliteration results with 
Test Set H 
With 7'c,s't Sc, t H~ we (:m~ get t;15(; fi)llowing 
result (Table, 4). 
Table d: C.A. and W.A. with the Test Set H 
E-K tr~ms. E-K back tr~ms. 
method C.A. 14LA. C.A. I W.A. 
PUp \[~9.5% 57.2% 84.9% 40.9% 
PCa \[19o.6% 58.3% s4.8% 4(/.8% 
5.4 Variation coverage tests 
To (:oml)~re our result(PCp) with (Lee and 
()hoi, 1998), we tr~fincd our lnethods with the 
training data of Test Set L In ST, (Lee mid 
Choi, 1998) use 20 high rank results, but we 
j tlst llSe 5 results. TM)le 5 shows the (:overage 
of ore: i)rol)osed me.thod. 
Table 5: w~riation eover~ge with Tc.~t Set III 
method tf d.f ,~f 
ST 76.0% 73.9% 47.1% 
PCp 84.0% 84.0% 64.0% 
Fig. 9 shows the increase of (:overage with the 
number of outputs. 
5.5 Discussion 
We summarize the, information length ~md the 
kind of infonnation(Tnble 6). The results of 
experimenLs and information usage show theft 
MEM combines w~rious informal;ion better 
than DT and NN. ST does not list & previous 
inlmt (el-l) but use ~ previous output(t,:i_~) to 
calculate the current outlml?s probability like 
95 
90 
85 
80 
70 
65 
60 
55 
5O 
45 
---,I !-m-elf 
, , , ~ sf 
1 2 3 4 5 6 7 8 9 10 
Figure 9: The 17. C. result 
q~,l)le 6: Intbrmation Usage 
previous output 
ST 2 0 Y 
MEM 2 2 N 
NN \] 1 N 
DT 3 3 N 
PC Y 
Part-ofSt)eeeh rl'~gging probleln. But ST gets 
the lowest aecm'acy. It means that surrmmding 
alphal)ei;s give more informed;ion than t)revious 
outlmL. In other words, E-K trmlslii;e.ration is 
not the all)h~bet-per-alphabet or phonenle-per- 
t)honeme (:lassific~tion problem. A previous 
outI)ut does not give, enough information for 
cllrrent ltnit's dismnbiguat;ion. An input mill 
mid an OUtlmt unit shouht be exl:ende(t. E-K 
transliteration is a (:hunk-l)er-chunk classifica- 
tion prot)lenL 
We restri(:t the length of infiwm~tion, to see 
the influence of' phoneme-chunk size. Pig. 10 
shows the results. 
i 9oi ~~~_~",~ F 
70 
6oi 
5040 /~# .~ 
-- -- C.A. 7bst Sot f 30 / / I-~- c.A. z~.~ so,, I 
20 t~ / I~WA To,.t Set / i 
I o -×- ~i L;, L,, ! 
x ~ 
0 
1 2 3 4 5 6 7 
Figure 10: the result of ~ length limit test 
423 
With the same length of information, we 
get the higher C.A. and W.A. than other 
methods. It means previous outputs give good 
information and our chunk-based nmthod is 
a good combining method. It also suggests 
that we can restrict the max size of chunk in a 
permissible size. 
PCa gets a higher accuracy than PCp. It 
is clue to the number of possible phoneme se- 
quences. A transliteration network that con- 
sists of phoneme nnit has more nodes than a 
transliteration network that consists of alpha- 
bet unit. With small training data, despite of 
the loss due to the phoneme sequences ambi- 
guity a phoneme gives more intbrmation than 
an alphabet. When the infbrmation is enough, 
PCa outpertbrms Pep. 
6 Conclusions 
We propose the method of English-to-Korean 
transliteration and back-transliteration with 
multiple mfl)ounded overlapping phoneme 
chunks. We showed that E-K transliteration 
and back-transliteration are not a t)honeme- 
per-phoneme and alphabet-per-alphabet 
classification problem. So we use phoneme 
chunks that do not have a length limit and 
can explain E-K transliteration phenomena. 
We get the reliability of a given transliter- 
ation phenomenon by applying overlapt)ing 
phoneme chunks. Our method is simple and 
does not need a complex combining method 
tbr w, rious length of information. The change 
of an intbrmation length does not affect the 
internal representation of the problem. Our 
chunk-based method can be used to other 
classification problems and can give a simple 
combining method. 

References 

Tae-il Kim. 2000. English to Korean translit- 
eration model using maxinmm entropy model 
for cross  information retrieval. Mas- 
ter's thesis, Seogang University (in Korean). 

Kil Soon aeong, Sllng Hyun Myaeng, Jae Sung 
Lee, and Key-Sun Choi. 1999. Automatic 
identification and back-transliteration of for- 
eign words tbr information retrieval, b~:for- 
mation Processing and Management. 

Key-Sun Choi Jung-Jae Kim, Jae Sung Lee. 
1999. Pronunciation unit based automatic 
English-Korean transliteration model using 
neural network. In Pwceedings of Korea Cog- 
nitive Science Association(in Korean). 

Byung-Ju Kang. 1999. Automatic Korean- 
English back-transliteration. In Pwecedings 
of th, c 11th, Conference on Iiangul and Ko- 
rean Language Information Prvcessing( in Ko- Fean). 

Jae Sung Lee and Key-Sun Choi. 1998. English 
to Korean statistical transliteration for in- 
formation retrieval. Computer PTvcessin9 of 
Oriental Languages. 

K. Jeong, Y. Kwon, and S. H. Myaeng. 1997. 
The effect of a proper handling of foreign and 
English words in retrieving Korean text. In 
Proceedings of the 2nd Irdernational Work- 
shop on lrtforrnation Retrieval with Asian 
Languages. 

K. Knight and J. Graehl. 1997. Machine 
transliteration. In Proceedings o.f the 35th 
Annual Meeting of the Association J'or Com- 
putational Linguistics. 

Adam L. Berger, Stephen A. Della Pietra, and 
Vincent J. Della Pietra. 1996. A maximum 
entroI)y approach to natural  pro- 
cessing. Computational Linguistics. 

Y. C. Park, K. Choi, J. Kim, and Y. Kim. 
1996. Development of the data collection ver. 
2.0ktset2.0 tbr Korean intbrmation retrieval 
studies. In Artificial Intelligence Spring Con- 
ference. Korea Intbrmation Science Society (in Korean). 

Frank K. Soong and Eng-Fong Huang. 1991. 
A tree-trellis based Nst search for tinding 
the n best sentence hypotheses in eontimmus 
speech recognition. In IEEE International 
Conference on Acoustic Speech and Signal 
Pwcessing, pages 546-549. 

P. Hall and G. Dowling. 1980. Approximate 
string matching. Computing Surveys. 
