1 
Translating Names and Technical Terms in Arabic Text 
Bonnie Glover Stalls and Kevin Knight 
USC Information Sciences Institute 
Marina del Rey, CA 90292 
bgsQis i. edu, knigh~;~isi, edu 
Abstract 
It is challenging to translate names and technical terms from English into Arabic. Translation is 
usually done phonetically: different alphabets and sound inventories force various compromises. 
For example, Peter Streams may come out as hr..~ ~ bytr szrymz. This process is called 
transliteration. We address here the reverse problem: given a foreign name or loanword in Arabic 
text, we want to recover the original in Roman script. For example, an input like .~..A~ 
bytr strymz should yield an output like Peter Streams. Arabic presents special challenges due 
to unwritten vowels and phonetic-context effects. We present results and examples of use in an 
Arabic-to-English machine translator. 
Introduction It is not trivial to write an algorithm for turning 
Translators must deal with many problems, and 
one of the most frequent is translating proper 
names and technical terms. For language pairs 
like Spanish/English, this presents no great chal- 
lenge: a phrase like Antonio Gil usually gets trans- 
lated as Antonio Gil. However, the situation is 
more complicated for language pairs that employ 
very different alphabets and sound systems, such 
as Japanese/English and Arabic/English. Phonetic 
translation across these pairs is called translitera- 
tion. 
(Knight and Graehl, 1997) present a computa- 
tional treatment of Japanese/English translitera- 
tion, which we adapt here to the case in Arabic. 
Arabic text, like Japanese, frequently contains for- 
eign names and technical terms that are translated 
phonetically. Here are some examples from newspa- 
per text: a 
Jim Leighton 
oA 
(j ym 1 ! ytwn) 
Wall Street 
(wwl stryt) 
Apache helicopter 
(hlykwbtr ! b!tsby) 
IThe romanization of Arabic orthography used here 
consists of the following consonants: ! (alif), b, 
t, th, j, H, x, d, dh, r, z, s, sh, S, D, T, Z, 
G (@ayn), G (Gayn), f, q, k, 1, m, n, =h, w, y, ' 
(hamza). !, w, and y also indicate long vowels. !' 
and !+ indicate harnza over ali/and harnza under ali/, 
respectively. 
English letter sequences into Arabic letter sequences, 
and indeed, two human translators will often pro- 
duce different Arabic versions of the same English 
phrase. There are many complexity-inducing fac- 
tors. Some English vowels are dropped in Arabic 
writing (but not all). Arabic and English vowel in- 
ventories are also quite different--Arabic has three 
vowel qualities (a, i, u) each of which has short 
and long variants, plus two diphthongs (ay, aw), 
whereas English has a much larger inventory of as 
many as fifteen vowels and no length contrast. Con- 
sonants like English D are sometimes dropped. An 
English S sound frequently turns into an Arabic s, 
but sometimes into z. English P and B collapse into 
Arabic b; F and V also collapse to f. Several En- 
glish consonants have more than one possible Arabic 
rendering--K may be Arabic k or q, t may be Ara- 
bic t or T (T is pharyngealized t, a separate letter 
in Arabic). Human translators accomplish this task 
with relative ease, however, and spelling variations 
are for the most part acceptable. 
In this paper, we will be concerned with a more 
difficult problem--given an Arabic name or term 
that has been translated from a foreign language, 
what is the transliteration source? This task chal- 
lenges even good human translators: 
?  jj.cu 
(m'yk m!kwry) 
? 
? 
( !ntrnt !ksblwrr) 
(Answers appear later in this paper). 
34 
Among other things, a human or machine transla- 
tor must imagine sequences of dropped English vow- 
els and must keep an open mind about Arabic letters 
like b and f. We call this task back-transliteration. 
Automating it has great practical importance in 
Arabic-to-English machine translation, as borrowed 
terms are the largest source of text phrases that do 
not appear in bilingual dictionaries. Even if an En- 
glish term is listed, all of its possible Arabic variants 
typically are not. Automation is also important for 
machine-assisted translation, in which the computer 
may suggest several translations that a human trans- 
lator has not imagined. 
2 Previous Work 
(Arbabi et al., 1994) developed an algorithm at IBM 
for the automatic forward transliteration of Arabic 
personal names into the Roman alphabet. Using a 
hybrid neural network and knowledge-based system 
approach, this program first inserts the appropriate 
missing vowels into the Arabic name, then converts 
the name into a phonetic representation, and maps 
this representation into one or more possible Roman 
spellings of the name. The Roman spellings may also 
vary across languages (Sharifin English corresponds 
to Chgrife in French). However, they do not deal 
with back-transliteration. 
(Knight and Graehl, 1997) describe a back- 
transliteration system for Japanese. It comprises a 
generative model of how an English phrase becomes 
Japanese: 
1. An English phrase is written. 
2. A translator pronounces it in English. 
3. The pronunciation is modified to 
Japanese sound inventory. 
fit the 
4. The sounds are converted into the Japanese 
katakana alphabet. 
5. Katakana is written. 
They build statistical models for each of these five 
processes. A given model describes a mapping be- 
tween sequences of type A and sequences of type B. 
The model assigns a numerical score to any particu- 
lar sequence pair a and b, also called the probability 
of b given a, or P(b\]a). The result is a bidirectional 
translator: given a particular Japanese string, they 
compute the n most likely English translations. 
Fortunately, there are techniques for coordinating 
solutions to sub-problems like the five above, and 
for using generative models in the reverse direction. 
These techniques rely on probabilities and Bayes' 
Rule. 
For a rough idea of how this works, suppose we 
built an English phrase generator that produces 
word sequences according to some probability dis- 
tribution P(w). And suppose we built an English 
pronouncer that takes a word sequence and assigns 
it a set of pronunciations, again probabilistically, ac- 
cording to some P(elw ). Given a pronunciation e, 
we may want to search for the word sequence w that 
maximizes P(w\[e). Bayes' Rule lets us equivalently 
maximize P(w) • P(e\]w), exactly the two distribu- 
tions just modeled. 
Extending this notion, (Knight and Graehl, 1997) 
built five probability distributions: 
1. P(w) - generates written English word se- 
quences. 
2. P(e\]w) - pronounces English word sequences. 
3. P(jle) - converts English sounds into Japanese 
sounds. 
4. P(klj ) - converts Japanese sounds to katakana 
writing. 
5. P(o\[k) - introduces misspellings caused by op- 
tical character recognition (OCR). 
Given a Japanese string o they can find the En- 
glish word sequence w that maximizes the sum over 
all e, j, and k, of 
P(w) • P(elw) • P(jle) • P(klj) • P(olk) 
These models were constructed automatically 
from data like text corpora and dictionaries. The 
most interesting model is P(jle), which turns En- 
glish sound sequences into Japanese sound se- 
quences, e.g., S AH K ER (soccer) into s a kk a a. 
Following (Pereira and Riley, 1997), P(w) is 
implemented in a weighted finite-state acceptor 
(WFSA) and the other distributions in weighted 
finite-state transducers (WFSTs). A WFSA is a 
state/transition diagram with we.ights and symbols 
on the transitions, making some output sequences 
more likely than others. A WFST is a WFSA with 
a pair of symbols on each transition, one input and 
one output. Inputs and outputs may include the 
empty string. Also following (Pereira and Riley, 
1997), there is a general composition algorithm for 
constructing an integrated model P(xlz ) from mod- 
els P(x\]y) and P(y\]z). They use this to combine an 
observed Japanese string with each of the models in 
turn. The result is a large WFSA containing all pos- 
sible English translations, the best of which can be 
extracted by graph-search algorithms. 
35 
3 Adapting to Arabic 
There are many interesting differences between Ara- 
bic and Japanese transliteration. One is that 
Japanese uses a special alphabet for borrowed for- 
eign names and borrowed terms. With Arabic, 
there are no such obvious clues, and it is diffi- 
cult to determine even whether to attempt a back- 
transliteration, to say nothing of computing an accu- 
rate one. We will not address this problem directly 
here, but we will try to avoid inappropriate translit- 
erations. While the Japanese system is robust-- 
everything gets some transliteration--we will build a 
deliberately more brittle Arabic system, whose fail- 
ure signals that transliteration may not be the cor- 
rect option. 
While Japanese borrows almost exclusively from 
English, Arabic borrows from a wider variety of lan- 
guages, including many European ones. Fortunately, 
our pronunciation dictionary includes many non- 
English names, but we should expect to fail more 
often on transliterations from, say, French or Rus- 
sian. 
Japanese katakana writing seems perfectly pho- 
netic, but there is actually some uncertainty in how 
phonetic sequences are rendered orthographically. 
Arabic is even less deterministically phonetic; short 
vowels do not usually appear in written text. Long 
vowels, which are normally written in Arabic, often 
but not always correspond to English stressed vow- 
els; they are also sometimes inserted in foreign words 
to help disambiguate pronunciation. Because true 
pronunciation is hidden, we should expect that it 
will be harder to establish phonetic correspondences 
between English and Arabic. 
Japanese and Arabic have similar consonant- 
conflation problems. A Japanese r sound may have 
an English r or 1 source, while an Arabic b may come 
from p or b. This is what makes back-transliteration 
hard. However, a striking difference is that while 
Japanese writing adds extra vowels, Arabic writing 
deletes vowels. For example: 2 
Hendette --~ H Ell N R IY EH T (English) 
-~t h e n o r i ett o (Japanese) 
=h n r y t (Arabic) 
This means potentially much more ambiguity; we 
have to figure out which Japanese vowels shouldn't 
~The English phonemic representation uses the 
phoneme set from the online Carnegie Mellon Uni- 
versity Pronouncing Dictionary, a machine-readable 
pronunciation dictionary for North American English 
(http://w~. speech, cs. aau. edu/cgi-b in/cmudict). 
be there (deletion), but we have to figure out which 
Arabic vowels should be there (addition). 
For cases where Arabic has two potential map- 
pings for one English consonant, the ambiguity does 
not matter. Resolving that ambiguity is bonus when 
going in the backwards direction--English T, for ex- 
ample, can be safely posited for Arabic t or T with- 
out losing any information• 
4 New Model for Arabic 
Fortunately, the first two models of (Knight and 
Graehl, 1997) deal with English only, so we can re- 
use them directly for Arabic/English transliteration. 
These are P(w), the probability of a particular En- 
glish word sequence and P(elw), the probability of 
an English sound sequence given a word sequence. 
For example, P(Peter) may be 0.00035 and P(P IY 
T gRlPeter ) may be 1.0 (if Peter has only one pro- 
nunciation). 
To follow the Japanese system, we would next 
propose a new model P(qle) for generating Arabic 
phoneme sequences from English ones, and another 
model P(alq) for Arabic orthography. We would 
then attempt to find data resources for estimating 
these probabilities. This is hard, because true Ara- 
bic pronunciations are hidden and no databases are 
available for directly estimating probabilities involv- 
ing them. 
Instead, we will build only one new model, P(ale ), 
which converts English phoneme sequences directly 
into Arabic writing. ~,Ve might expect the model to 
include probabilities that look like: 
P(flF) = 1.0 
P(tlT ) = 0.7 
P(TIT ) = 0.3 
P(slS ) = 0.9 
P(zIS) -- 0.1 
P(wlAH) = 0.2 
P(nothinglAH ) = 0.4 
P(!+IAH) = 0.4 
The next problem is to estimate these numbers 
empirically from data. We did not have a large 
bilingual dictionary of names and terms for Ara- 
bic/English, so we built a small 150-word dictionary 
by hand. We looked up English word pronunciations 
in a phonetic dictionary, generating the English- 
phoneme-to-Arabic-writing training data shown in 
Figure 1. 
We applied the EM learning algorithm described 
in (Knight and Graehl, 1997) on this data, with one 
variation. They required that each English sound 
36 
((AE N T OW N IY ON) (! ' n T w n y w)) 
((AE N T AH N IY) (.' ' n T w n y)) 
((AA N W AA R) (! ' n w r)) 
((AA R M IH T IH JH) (! ' r m y t ! j)) 
((AA R N AA L D OW) (! r n i d w)) 
((AE T K IH N Z) (! ' t k y n z)) 
((K AO L V IY N OW) (k ! 1 f y n w)) 
((K AE M ER AH N) (k ! m r ! n)) 
((K AH M IY L) (k m y i)) 
((K AA R L AH) (k '. r 1 .')) 
((K AE R AH L) (k ! r w i)) 
((K EH R AH LAY N) (k ! r w 1 y n)) 
((K EH R AH L IH N) (k ! r w 1 y n)) 
((K AA R Y ER) (k ! r f r)) 
((K AE S AH L) (k ! s I)) 
((K R IH S) (k r y s)) 
((K R IH S CH AH N) (k r y s t s h n)) 
((K R IH S T AH F ER) (k r y s t w f r)) 
((K L AO D) (k 1 w d)) 
((K LAY D) (k 1 ! y d)) 
((K AA K R AH N) (k w k r ! n)) 
((K UH K) (k w k)) 
((K AO R IH G AH N) (k w r y G ! n)) 
((EH B ER HH AA R T) (! + y b r ffi h ! r d)) 
((EH D M AH N D) (! + d m w n)) 
((EH D W ER D) (! ' d w ! r d)) 
((AH LAY AH S) (! + i y ! s) 
((IH L IH Z AH BAH TH) (! + 1 y z ! b y t h)) 
Figure 1: Sample of English phoneme to Arabic writ- 
ing training data. 
5 Problems Specific to Arabic 
One problem was the production of many wrong En- 
glish phrases, all containing the sound D. For ex- 
ample, the Arabic sequence 0~ frym!n yielded 
two possible English sources, Freeman and Fried- 
man. The latter is incorrect. The problem proved 
to be that, like several vowels, an English D sound 
sometimes produces no Arabic letters. This happens 
in cases like .jl~i Edward ! 'dw!r and 03~7.~ Ray- 
mond rymwn. Inspection showed that D should only 
be dropped in word-final position, however, and not 
in the middle of a word like Friedman. 
This brings into question the entire shape of our 
P(ale ) model, which is based on a substitution of 
Arabic letters for an English sound, independent of 
that sound's context. Fortunately, we could incor- 
porate an only-drop-final-D constraint by extending 
the model's transducer format. 
The old transducer looked like this: 
S/z'~ "'" 
While tile new transducer looks like this: 
produce at least one Japanese sound. This worked 
because Japanese sound sequences are always longer 
than English ones, due to extra Japanese vowels. 
Arabic letter sequences, on the other hand, may be 
shorter than their English counterparts, so we allow 
each English sound the option of producing no Ara- 
bic letters at all. This puts an extra computational 
strain on the learning algorithm, but is otherwise 
not difficult. 
Initial results were satisfactory. The program 
learned to map English sounds onto Arabic letter 
sequences, e.g.: Nicholas onto ~r,N~" nykwl ! s and 
Williams onto .~..~ wlymz. 
We applied our three probabilistic models to pre- 
viously unseen Arabic strings and obtained the top 
n English back-transliteration for each, e.g., 
byfrly 
bykr 
!'dw!r 
=hdswn 
=hwknz 
Beverly Beverley 
Baker Picker Becker 
Edward Edouard Eduard 
Hudson Hadson Hodson 
Hawkins Huggins Huckins 
~Ve then detected several systematic problems 
with our results, which we turn to next. 
D/a  
S/z v 
Whenever D produces no letters, the transducer 
finds itself in a final state with no further transitions. 
It can consume no further English sound input, so 
it has, by definition, come to the end of the word. 
We noticed a similar effect with English vowels 
at the end of words. For example, the system sug- 
gested both Manuel and Manuela as possible sources 
for ~,SL~ ,,!nwyl. Manuela is incorrect; we elimi- 
nated this frequent error with a technique like the 
one described above. 
A third problem also concerned English vowels. 
For Arabic .~.'lzf~i !'wkt !fy., the system produced 
both Octavio and Octavia as potential sources, 
though the latter is wrong. While it is possible for 
the English vowel ~ (final in Octavia) to produce 
Arabic w in some contexts (e.g., .~..%~ rwjr/Roger), 
it cannot do so at the end of a word. Eli and AA have 
the same property. Furthermore, none of those three 
vowels can produce the letter y when in word-final 
position. Other vowels like IY may of course do so. 
We pursued a general solution, replacing each in- 
37 
stance of an English vowel in our training data with e II 
one of three symbols, depending on its position in AA 
the word. For example, an AH in word-initial po- AA-S 
sit!on was replaced by AH-S; word-final AH was re- ,, 
placed by AH-F; word-medial was htI. This increases AE AE-S 
our vowel sound inventory by a factor of three, and AH " 
even though AH might be pronounced the same in 
any position, the three distinct AH- symbols can ac- 
quire different mappings to Arabic. In the case of 
AH, learning revealed: ,, 
P(wIAH ) - 0.288 
P(nothingl~i ) = 0.209 
P( IAH) = 0.260 
P(ylAH) = 0.173 
P(!IAH-F) -- 0.765 
P(&IAH-F) : 0.176 
P(=hIAH-F ) : 0.059 
P(!+IAH-S) = 0.5 
P(!yIAH-S) = 0.5 
P(! '\[AH-S) -- 0.25 
We can see that word-final AH can never be 
dropped. We can also see that word-initial AH can be 
dropped; this goes beyond the constraints we origi- 
nally envisioned. Figure 2 shows the complete table 
of sound-letter mappings. 
We introduced just enough context in our sound 
mappings to achieve reasonable results. We could, 
of course, introduce left and right context for every 
sound symbol, but this would fragment our data; it 
is difficult to learn general rules from a handful of 
examples. Linguistic guidance helped us overcome 
these problems. 
6 EXAMPLE 
Here we show the internal workings of the system 
through an example. Suppose we observe the Ara- 
bic string br!nstn. First, we pass it though the 
P(a\[e) model from Figure 2, producing the network 
of possible English sound sequences shown in Fig- 
ure 3. Each sequence ei could produce (or "explain") 
br!nstn and is scored with P(br!nstn\[ ei). For ex- 
ample, P(br\[nstn\[BRAENSTN) = 0.54. 
AH-F 
AH-S 
AO 
AY 
AY-F 
AY-S 
B 
CH 
D 
EH 
EH-S 
ER 
EY 
EY-F 
EY-S 
F 
G 
HH 
IH 
IH-S 
IY 
IY-F 
IY-S 
JH 
K 
L 
M 
N 
NG 
OW 
OW-F 
OW-S 
P 
R 
S 
SH 
T 
TH 
UH 
UW 
UW-F 
V 
W 
Y 
Z 
a P(ale) a P(ale) a P(ale) 
! 0.652 
!' 0.625 
! 0.125 
w 0.217 
!'w 0.125 
* 0.131 
!'H 0.125 
! 0.889 * 0.III 
! ' 0.889 ! 0.III 
w 0.288 * 0.269 l 0.269 
y 0.173 
! 0.765 & 0.176 =h 0.059 
!+ 0.5 !y 0.25 ! ' 0.25 
w 0.8 y 0.I ! 0.I 
y 0.8 !y 0.2 
y 1.0 
!+ 1.0 
b 1.0 
x 0.5 tsh 0.5 
d 0.968 * 0.032 
• 0.601 y 0.25 ! 0.1 
h 0.049 
! ' 0.667 !+ 0.167 !+y 0.167 
r 0.684 yr 0.182 wr 0.089 
! +r 0.045 
• 0.Iii y 0.444 
!@y 0.III 
~y 0.639 
!' 0.5 
! 0.333 
y 0.361 
! 0.5 
f 1.0 
G 0.833 k 0.167 
=h 0.885 0.113 
y 0.868 
r 0.026 
! 0.375 
!' 0.125 
* 0.079 
!+ 0.25 
!+y 0.125 
* 0.064 y 0.909 
y 1.0 
! +y 1.0 
j 1.0 
k 0.981 +q 0.019 
i 1.0 
ww 0.286 
! 0.026 
!y 0.125 
h 0.027 
m 1.0 
n 1.0 
nG 1.0 
e 0.714 
e 1.0 
! 'w 1.0 
b 1.0 
r 0.98 
s 0.913 
Ty 0.333 
t 0.682 
th 1.0 
y 0.02 
z 0.087 
shy 0.333 sh 0.333 
T 0.273 d 0.045 
w 1.0 
w 1.0 
w 1.0 
f 1.0 
w 0.767 w! 0.121 
y 1.0 
z 0.75 s 0.25 
fy 0.111 
Figure 2: English sounds (in capitals) with prob- 
abilistic mappings to Arabic sound sequences, as 
learned by estimation-maximization. 
38 
Figure 3: English sound sequences corresponding to Arabic br !nsrn. Each path through the lattice repre- 
sents one possible sequence. 
Arabic input: b r ! n s t n 
Top 20 English pronunciations P(a\[e) 
B R AE N S T N 0.541074 
P R AE N S T N 0.541074 
B R AH-F N S T N 0.465519 
P R AH-F N S T N 0.465519 
B R AA N S T N 0.397275 
P R AA N S T N 0.397275 
B FAt AE N S T N 0.377096 
PER AE N S T N 0.377096 
P R AE N S EH T N 0.324645 
B R AE N EH S T N 0.324645 
B EH R AE N S T N 0.324645 
P R AE EH N S T N 0.324645 
P R AE N S T EH N 0.324645 
P EH R AE N S T N 0.324645 
B R AE N S T N EH 0.324645 
B R EH AE N S T N 0.324645 
P R AE N EN S T N 0.324645 
P R AE N S T N EH 0.324645 
P R EH AE N S T N 0.324645 
EH B R AE N S T N 0.324645 
Next, we pass this network through the P(e\[w) 
model to produce a new network of English phrases. 
Finally, we re-score this network with the P(w) 
model. This marks a preference for common En- 
glish words/names over uncommon ones. Here are 
the top n sequences at each stage: 
Top 20 word sequences P(elw) • P(ale) 
BRANN STEN 0.324645 
BRAN STEN 0.324645 
BRONN STEN 0.238365 
PUR ANSE TEN 0.226257 
PUR ANSE TENN 0.226257 
PUR ANNE STEN 0.226257 
PERE ANSE TEN 0.226257 
PUR ANN STEN 0.226257 
PER ANSE TEN 0.226257 
PERE ANSE TENN 0.226257 
PER ANSE TENN 0.226257 
PERE ANNE STEN 0.226257 
PER ANNE STEN 0.226257 
PERE ANN STEN 0.226257 
PUR AN STEN 0.226257 
PER ANN STEN 0.226257 
PF~E AN STEN 0.226257 
PERE AHN STEN 0.226257 
PUR AHN STEN 0.226257 
PErt AN STEN 0.226257 
R~scored P(w) • P(e\[w) • P(ale ) 
word sequences 
BRONSTON 8.63004e-07 
BRONSTEIN 7.29864e-08 
BRAUNSTEIN 1.11942e-08 
39 
7 Results and Discussion 
We supplied a list of 2800 test names in Arabic to 
our program and received translations for 900 of 
them. Those not translated were frequently not for- 
eign names at all, so the program is right to fail in 
many such cases. Sample results are shown in Fig- 
ure 4. 
The program offers many good translations but 
still makes errors of omission and commission. Some 
of these errors show evidence of lexical or ortho- 
graphic influence or of interference from other lan- 
guages (such as French). 
English G is incorrectly produced from its voice- 
less counterpart in Arabic, k. For example, d~l..p" 
krys comes out correctly as Chris and Kr/s but 
also, incorrectly, as Grace. The source of the G-k 
correspondence in the training data is the English 
name AE L AH G Z AE N D ER Alexander, which is 
.~ "a.z....Q1 !lksndr in our training corpus. A voiced 
fricative G is available in Arabic, which in other con- 
texts corresponds to the English voiced stop G, al- 
though it, too, is only an approximation. It appears 
that orthographic English X is perceived to corre- 
spond to Arabic ks, perhaps due partly to French 
influence. Another possible influence is the existing 
Arabic name 1~I ! skndr (which has k), from the 
same Greek source as the English name Alezander, 
but with metathesis of k and s. 
Sometimes an Arabic version of a foreign name 
is not strictly a transliteration, but rather a transla- 
tion or lexicalized borrowing which has been adapted 
to Arabic phonological patterns. For example, the 
name Edward is found in our data as a.~l.~a/! ' dw!rd, 
.jl~.~' 'dw!r, and.~l~! !+dw!r. The last version, an 
Arabicization of the original Anglo-Saxon name, is 
pronounced Idwar. The approach taken here is flex- 
ible enought to find such matches. 
Allowing the English sound D to have a zero match 
word-finally (also a possible French influence) proves 
to be too strong an assumption, leading to matches 
such as: !' lfr Oliver Alfred. "A revised rule would 
allow the D to drop word-finally only when immedi- 
ately preceded by another consonant (which conso- 
nant could further be limited to a sonorant). 
Another anomaly which is the source of error is 
the mapping of English CH to Arabic x, which carries 
equal weight (0.5) to the mapping of Clt to Arabic 
tsh (0.5). This derives from the name Koch, which 
in Arabic is ~-j.C'kwx, as in the German pronuncia- 
tion of the name, as opposed to the English pronun- 
ciation. This kind of language interference can be 
minimized by enlarging the training data set. 
40 
!'bwrzq 
!'by 
!'byD 
!'d!mz 
!'drys 
!'dw!r 
!'dw!rd 
!~dwnys 
!'dyb 
!'f!y~ 
!'fr!H 
!'fyn!sh 
!'krm 
!'l!n 
!'lbrt 
!'lbrty 
!'ibyr 
!'If!rw 
!'ifr 
!'lksndr 
!'in 
!~lys 
!'lyswn 
!~mjd 
!'mnwn 
!'mrz!q~ 
!'mst 
!'mWS 
!'mykw 
!'my1 
!'mym& 
!'myn 
!'myn& 
!'myr 
!'n! 
!'nGz 
!'nTw!n 
!'nTwInyt 
!'nTwn 
!'nTwny 
!'nTwny! 
!'nTwnyw 
!'ndrw 
!'ndry=h 
!'ndryyf 
ABBEY ABBY ABBIE 
ADAMS ADDAMS 
EDRIS 
EDWARD EDOUARD EDUARD 
EDWARD EDOUAKD EDUARD 
AVERA 
ALAN ALLEN ALLAN 
ALBERT ALPERT ELBERT 
ALBERTI ALBERTY 
ALPER 
ALVARO ALFARO ALVERO 
OLIVER OLIVER ALFRED 
ALEXANDER ALEXANDER ALEXANDRE 
ALAN ALLEN ALLAN 
ELLIS ALICE LAS 
ALLISON ALISON ELLISON 
AMOS AMOSS 
AMIC0 AMERCO 
EMIL EMILE EMAIL 
AMMAN AMIN AMMEEN 
AMER AMIR AMOR 
ANA ANNA ANA 
INIGUEZ 
ANTOINE ANTOINE 
ANTOINETTE ANTOINETTE 
ANTON ANT00N ANTOINE 
ANTONY ANTONI ANTONE 
ANTONIA 
ANTONIO ANTONIU 
ANDREW ANDREU 
ANDREA ANDREA ANDRIA 
Figure 4: Sample program results for English trans- 
lations of names written in Arabic. 
English orthography appears to have skewed the 
training data in the English name Frederick, pro- 
nounced F R F_hi D R IH K. In Arabic we have A.~) 
frdyrk as well as frydryk, frdyryk and frdryk 
for this name. The English spelling has three vow- 
els, but the English phonemic representation lacks 
the middle vowel. But some Arabic versions have 
a (long) "vowel" (y) in the middle vowel position, 
leading in the training data to the incorrect map- 
ping English R to Arabic y. This results in incorrect 
translations, such as Racine for ysyn. 
As might be expected when the sound system of 
one language is being re-interpreted in another, Ara- 
bic transliteration is not always attuned to the sub- 
tleties of English phonotactic variation, especially 
when the variants are not reflected in English or- 
thography. An example is the assimilation in voic- 
ing of English fricatives to an immediately preceding 
stop consonant. In James, pronounced JH EY H Z, 
the final consonant is a voiced Z although it is spelled 
with the voiceless variant, s. In this case, Arabic 
follows the English orthography rather than pro- 
nunciation, transliterating it O-~T jyms. Similarly, 
Horowitz is pronounced HH A0 R 0W IH T $ in En- 
glish, with a final devoiced $ rather than the voiced 
variant z present in the spelling, whereas the Arabic 
transliteration follows the English spelling, ff~ .%~ja~ 
=hwrwwytz. The present version of the program ap- 
plies these variant correspondences indiscriminately, 
such that ~3.~.I... s!ymwn is translated as Simon or 
Zyman. Separating out these correspondences ac- 
cording to their positions in the word, as was done 
with the vowels, would help to rectify this, by re- 
ducing the probability of an S--z correspondence in 
less likely positions (e.g., initial position). 
Some Arabic transliterations carry the imprint 
of English spelling even when it departs even far- 
ther from the pronunciation. For example, i,~I.~ 
Gr!=h!m is an Arabic transliteration for the En- 
glish name Graham, pronounced G R AE H. (an al- 
ternative is the Arabic i'~ Gr!m). These mappings 
were not found by the program (even though they 
might be readily evident to a human). This kind 
of spelling-transliteration lies outside of the phone- 
mic correspondences to Arabic orthography that the 
program has learned. 
Vowels are still a problem, even when they are dis- 
tinguished by their position in the word. In the test 
cases given in the Introduction, (answers are Mike 
McCurrg, OPEC, and lnternet Explorer), the qual- 
ity of the Arabic vowels, when present, matches the 
English vowels fairly well. However, as can be seen 
from names like Frederick, the decision as to whether 
or not to insert vowel is arbitrary and somewhat de- 
pendent on English orthography, which influences 
the quality as well as position of the Arabic vowel. 
Medial English AIt, for example, is normally ! (alif 
but can also be found in Arabic as t~ or y (e.g, En- 
glish Jeremy, pronounced JH EH R AH H IY is writ- 
ten in Arabic as jyrymy). This results in incorrect 
translations, such as Amman for Arabic ! 'myn. 
In this initial model, English vowel stress was not 
represented. Because long vowels in Arabic are usu- 
ally stressed, one might expect that English stressed 
vowels would be equated with Arabic long vowels for 
purposes of transliteration. However, our data sug- 
gest that English stress does not have a strong corre- 
lation with Arabic long vowel placement in translit- 
erated names. For example, Kevin mapped to 
kyfyn and ".~d~kfyn, but not .~(.. kyfn. If stress 
were a factor here and were interpreted as a long 
vowel, -~(.. kyfn would be predicted as a preferred 
transliteration based on the phonemic representa- 
tion of Kevin as K EH1 V IH N (where "1" indicates 
primary stress). Similarly,.~." fyktwr and .~,i 
fktt~r were found for Victor but not the expected 
fyktr. "~.~kynth is found, but so are 
kynyth, and "~.~knyth. In syllable-final position at 
least, it appears that stress does not outweigh other 
factors in Arabic vowel placement. However, the re- 
lation of English stress to Arabic vowel placement 
in other positions might be used to rule out unlikely 
translations (such as Camille with final stress for 
Arabic 0.,~Sk!ml) and deserves further study. 
All of these observations point to places where the 
system can be improved. A larger training data set, 
more selected contextual mappings, and refinement 
of linguistic rules are all potential ways to capture 
these improvements within the finite-state frame- 
work, and we hope to study them in the near future. 

References 
Mansur Arbabi, Scott M. Fischthal, Vincent C. 
Cheng, and Elizabeth Bart. 1994. Algorithms 
for Arabic name transliteration. IBM Journal of 
Research and Development, 38(2):183-193. 
Kevin Knight and Jonathan Graehl. 1997. Ma- 
chine transliteration. In Proceedings of the 35th 
Annual Meeting of the Association for Computa- 
tional Linguistics, pages 128-135. Morgan Kauf- 
mann. 
Fernando C. N. Pereira and Michael Pdley. 1997. 
Speech recognition by composition of weighted 
finite automata. In E. Roche and Y. Schabes, 
editors, Finite-State Language Processing, pages 
431-453. MIT Press. 
