Transliteration of Proper Names in Cross-Lingual Information Retrieval
Paola Virga
Johns Hopkins University
3400 North Charles Street
Baltimore, MD 21218, USA
paola@jhu.edu
Sanjeev Khudanpur
Johns Hopkins University
3400 North Charles Street
Baltimore, MD 21218, USA
khudanpur@jhu.edu
Abstract
We address the problem of transliterating
English names using Chinese orthogra-
phy in support of cross-lingual speech and
text processing applications. We demon-
strate the application of statistical ma-
chine translation techniques to “translate”
the phonemic representation of an En-
glish name, obtained by using an auto-
matic text-to-speech system, to a sequence
of initials and finals, commonly used sub-
word units of pronunciation for Chinese.
We then use another statistical translation
model to map the initial/final sequence
to Chinese characters. We also present
an evaluation of this module in retrieval
of Mandarin spoken documents from the
TDT corpus using English text queries.
1 Introduction
Translation of proper names is generally recognized
as a significant problem in many multi-lingual text
and speech processing applications. Even when
hand-crafted translation lexicons used for machine
translation (MT) and cross-lingual information re-
trieval (CLIR) provide significant coverage of the
words encountered in the text, a significant portion
of the tokens not covered by the lexicon are proper
names and domain-specific terminology (cf., e.g.,
Meng et al (2000)). This lack of translations ad-
versely affects performance. For CLIR applications
in particular, proper names and technical terms are
especially important, as they carry the most distinc-
tive information in a query as corroborated by their
relatively low document frequency. Finally, in in-
teractive IR systems where users provide very short
queries (e.g. 2-5 words), their importance grows
even further.
Unlike specialized terminology, however, proper
names are amenable to a speech-inspired translation
approach. One tries, when writing foreign names in
ones own language, to preserve the way it sounds.
i.e. one uses an orthographic representation which,
when “read aloud” by a speaker of ones language
sounds as much like it would when spoken by a
speaker of the foreign language — a process re-
ferred to as transliteration. Therefore, if a mecha-
nism were available to render, say, an English name
in its phonemic form, and another mechanism were
available to convert this phonemic string into the or-
thography of, say, Chinese, then one would have
a mechanism for transliterating English names us-
ing Chinese characters. The first step has been ad-
dressed extensively, for other obvious reasons, in the
automatic speech synthesis literature. This paper de-
scribes a statistical approach for the second step.
Several techniques have been proposed in the
recent past for name transliteration. Rather than
providing a comprehensive survey we highlight a
few representative approaches here. Finite state
transducers that implement transformation rules
for back-transliteration from Japanese to English
have been described by Knight and Graehl (1997),
and extended to Arabic by Glover-Stalls and
Knight (1998). In both cases, the goal is to recog-
nize words in Japanese or Arabic text which hap-
Figure 1: Four steps in English-to-Chinese transliteration of names.
pen to be transliterations of English names. If the
orthography of a language is strongly phonetic, as
is the case for Korean, then one may use relatively
simple hidden Markov models to transform English
pronunciations, as shown by Jung et al (2000). The
work closest to our application scenario, and the one
with which we will be making several direct com-
parisons, is that of Meng et al (2001). In their work,
a set of hand-crafted transformations for locally edit-
ing the phonemic spelling of an English word to con-
form to rules of Mandarin syllabification are used
to seed a transformation-based learning algorithm.
The algorithm examines some data and learns the
proper sequence of application of the transforma-
tions to convert an English phoneme sequence to a
Mandarin syllable sequence. Our paper describes a
data driven counterpart to this technique, in which
a cascade of two source-channel translation models
is used to go from English names to their Chinese
transliteration. Thus even the initial requirement of
creating candidate transformation rules, which may
require knowledge of the phonology of the target
language, is eliminated.
We also investigate incorporation of this translit-
eration system in a cross-lingual spoken document
retrieval application, in which English text queries
are used to index and retrieve Mandarin audio from
the TDT corpus.
2 Translation System Description
We break down the transliteration process into vari-
ous steps as depicted in Figure 1.
1. Conversion of an English name into a phone-
mic representation using the Festival1 speech
synthesis system.
2. Translation of the English phoneme sequence
into a sequence of generalized initials and fi-
nals or GIFs — commonly used sub-syllabic
units for expressing pronunciations of Chinese
characters.
3. Transformation of the GIF sequence into a se-
quence of pin-yin symbols without tone.
4. Translation of the pin-yin sequence to a charac-
ter sequence.
Steps 1. and 3. are deterministic transformations,
while Steps 2. and 4. are accomplished using statis-
tical means.
The IBM source-channel model for statistical ma-
chine translation (P. Brown et al., 1993) plays a cen-
tral role in our system. We therefore describe it very
briefly here for completeness. In this model, a a0-
word foreign language sentence a1 a2 a3a4 a3a5 a6 a6 a6 a3a7
is modeled as the output of a “noisy channel” whose
input is its correct a8-word English translation a9 a2
a10
a4
a10
a5 a6 a6 a6
a10a11, and having observed the channel out-
put a1, one seeks a posteriori the most likely English
sentence
a12
a9 a2 a13a14a15 a16a13a17
a18 a19 a20
a9a21a1 a22 a2 a13a14a15 a16a13a17
a18 a19 a20
a1 a21a9a22
a19 a20
a9a22
The translation model
a19 a20
a1 a21a9a22 is estimated from
a paired corpus of foreign-language sentences and
their English translations, and the language model
a19 a20
a9a22 is trained from English text. Software tools
1http://www.speech.cs.cmu.edu/festival
Figure 2: Schematic of a English-to-Chinese name transliteration system.
are available both for training models2 as well as for
decoding3 — the task of determining the most likely
translation a12a9.
Since we seek Chinese names which are translit-
eration of a given English name, the notion of
words in a sentence in the IBM model above is
replaced with phonemes in a word. The roles of
English and Chinese are also reversed. Therefore,
a1 a2 a3a4 a3a5 a6 a6 a6 a3a7 represents a sequence of English
phonemes, and a9 a2
a10
a4
a10
a5 a6 a6 a6
a10a11, for instance, a se-
quence of GIF symbols in Step 2. described above.
The overall architecture of the proposed translitera-
tion system is illustrated in Figure 2.
2.1 Translation Model Training
We have available from Meng et al (2000) a small
list of about 3875 English names and their Chinese
transliteration. A pin-yin rendering of the Chinese
transliteration is also provided. We use the Festi-
val text-to-speech system to obtain a phonemic pro-
nunciation of each English name. We also replace
all pin-yin symbols by their pronunciations, which
are described using an inventory of generalized ini-
tials and finals. The pronunciation table for this pur-
pose is obtained from an elementary Mandarin text-
book (Practical Chinese Reader, 1981). The net re-
2http://www-i6.informatik.rwth-
aachen.de/ och/software/GIZA++.html.
3http://www.isi.edu/licensed-sw/rewrite-decoder.
sult is a corpus of 3875 pairs of “sentences” of the
kind depicted in the second and third lines of Figure
1. The vocabulary of the English side of this parallel
corpus is 43 phonemes, and the Chinese side is 58
(21 initials and 37 finals). Note, however, that only
409 of the 21a037 possible initial-final combinations
constitute legal pin-yin symbols.
A second corpus of 3875 “sentence” pairs is de-
rived corresponding to the fourth and fifth lines of
Figure 1, this time to train a statistical model to
translate pin-yin sequences to Chinese characters.
The vocabulary of the pin-yin side of this corpus
is 282 and that of the character side is about 680.
These, of course, are much smaller than the inven-
tory of Chinese pin-yin- and character-sets. We
note that certain characters are preferentially used
in transliteration over others, and the resulting fre-
quency of character-usage is not the same as unre-
stricted Chinese text. However, there isn’t a distinct
set of characters exclusively for transliteration.
For purposes of comparison with the translitera-
tion accuracy reported by Meng et al (2001), we di-
vide this list into 2233 training name-pairs and 1541
test name-pairs. For subsequent CLIR experiments,
we create a larger training set of 3625 name-pairs,
leaving only 250 names-pairs for intrinsic testing of
transliteration performance. The actual training of
all translation models proceeds according to a stan-
dard recipe recommended in GIZA++, namely 5 it-
erations of Model 1, followed by 5 of Model 2, 10
HMM-iterations and 10 iterations of Model 4.
2.2 Language Model Training
The GIF language model required for translating En-
glish phoneme sequences to GIF sequences is esti-
mated from the training portion of the 3875 Chinese
names. A trigram language model on the GIF vo-
cabulary is estimated with the CMU toolkit, using
Good-Turing smoothing and Katz back-off. Note
that due to the smoothing, this language model does
not necessarily assign zero probability to an ille-
gal GIF sequence, e.g., one containing two consec-
utive initials. This causes the first translation sys-
tem to sometimes, though very rarely, produce GIF
sequences which do not correspond to any pin-yin
sequence. We make an ad hoc correction of such se-
quences when mapping a GIF sequence to pin-yin,
which is otherwise trivial for all legal sequences of
initials and finals. Specifically, a final e or i or a is
tried, in that order, between consecutive initials until
a legitimate sequence of pin-yin symbols obtains.
The language model required for translating pin-
yin sequences to Chinese characters is relatively
straightforward. A character trigram model with
Good-Turing discounting and Katz back-off is es-
timated from the list of transliterated names.
2.3 Decoding Issues
We use the ReWrite decoder provided by ISI, along
with the two translation models and their corre-
sponding language models trained, either on 2233
or 3625 name-pairs, as described above, to perform
transliteration of English names in the respective test
sets with 1541 or 250 name-pairs respectively.
1. An English name is first converted to a
phoneme sequence via Festival.
2. The phoneme sequence is translated into an
GIF sequence using the first translation model
described above.
3. The translation output is corrected if necessary
to create a legitimate pin-yin sequence.
4. The pin-yin sequence is translated into a se-
quence of Chinese characters using a second
translation model, also described above.
A small but important manual setting in the ReWrite
decoder is a list of zero fertility words. In the IBM
model described earlier, these are the words a10a0 which
may be “deleted” by the noisy channel when trans-
forming a9 into a1 . For the decoder, these are there-
fore the words which may be optionally inserted in
a12
a9 even when there is no word in a1 of which they are
considered a direct translation. For the usual case of
Chinese to English translation, these would usually
be articles and other function words which may not
be prevalent in the foreign language but frequent in
English.
For the phoneme-to-GIF translation model, the
“words” which need to be inserted in this manner
are syllabic nuclei! This is because Mandarin does
not permit complex consonant clusters in a way that
is quite prevalent in English. This linguistic knowl-
edge, however, need not be imparted by hand in the
IBM model. One can, indeed, derive such a list from
the trained models by simply reading off the list of
symbols which have zero fertility with high proba-
bility. This list, in our case, is a1 -i, e, u, o, r, ¨u,
ou, c, iu, iea2.
The second translation system, for converting pin-
yin sequences to character sequences, has a one-to-
one mapping between symbols and therefore has no
words with zero fertility.
2.4 Intrinsic Evaluation of Transliteration
We evaluate the efficacy of our transliteration at two
levels. For comparison with the very comparable
set-up of Meng et al (2001), we measure the accu-
racy of the pin-yin output produced by our system
after Step 3. in Section 2.3. The results are shown in
Table 1, where pin-yin error rate is the edit distance
between the “correct” pin-yin representation of the
correct transliteration and the pin-yin sequence out-
put by the system.
Translation Training Test Pin-yin Char
System Size Size Errors Errors
Meng et al 2233 1541 52.5% N/A
Small MT 2233 1541 50.8% 57.4%
Big MT 3625 250 49.1% 57.4%
Table 1: Pin-yin and character error rates in auto-
matic transliteration.
Note that the pin-yin error performance of our
fully statistical method is quite competitive with pre-
vious results. We further note that increasing the
training data results in further reduction of the syl-
lable error rate. We concede that this performance,
while comparable to other systems, is not satisfac-
tory and merits further investigation.
We also evaluate the efficacy of our second trans-
lation system which maps the pin-yin sequence pro-
duced by the previous stages to a sequence of Chi-
nese characters, and obtain character error rates of
12.6%. Thus every correctly recognized pin-yin
symbol has a chance of being transformed with
some error, resulting in higher character error rate
than the pin-yin error rate. Note that while signifi-
cantly lower error rates have been reported for con-
verting pin-yin to characters in generic Chinese text,
ours is a highly specialized subset of transliterated
foreign names, where the choice between several
characters sharing the same pin-yin symbol is some-
what arbitrary.
3 Spoken Document Retrieval System
Several multi-lingual speech and text applications
require some form of name transliteration, cross-
lingual spoken document retrieval being a proto-
typical example. We build upon the experimen-
tal infrastructure developed at the 2000 Johns Hop-
kins Summer Workshop (Meng et al., 2000) where
considerable work was done towards indexing and
retrieving Mandarin audio to match English text
queries. Specifically, we find that in a large number
of queries used in those experiments, English proper
names are not available in the translation lexicon,
and are subsequently ignored during retrieval. We
use the technique described above to transliterate all
such names into Chinese characters and observe the
effect on retrieval performance.
The TDT-2 corpus, which we use for our experi-
ments, contains 2265 audio clips of Mandarin news
stories, along with several thousand contemporane-
ously published Chinese text articles, and English
text and audio broadcasts. The articles tend to be
several hundred to a few thousand words long, while
the audio clips tend to be two minutes or less on av-
erage. The purpose of the corpus is to facilitate re-
search in topic detection and tracking and exhaustive
relevance judgments are provided for several topics.
i.e. for each of at least 17 topics, every English and
Chinese article and news clip has been examined by
a human assessor and determined to be either on-
or off-topic. We randomly select an English arti-
cle on each of the 17 topics as a query, and wish
to retrieve all the Mandarin audio clips on the same
topic without retrieving any that are off-topic. For
mitigating the variability due to query selection, we
choose up to 12 different English articles for each of
the 17 topics and average retrieval performance over
this selection before reporting any results. We use
the query term-selection and translation technique
described by Meng et al (2000) to convert the En-
glish document to Chinese, the only augmentation
being the transliterated names — there are roughly
2000 tokens in the queries which are not translat-
able, and almost all of them are proper names. We
report IR performance with and without the name-
transliteration.
We use a different information retrieval system
from the one used in the 2000 Workshop (Meng et
al., 2000) to perform the retrieval task. A brief de-
scription of the system is therefore in order.
3.1 The HAIRCUT System
The Hopkins Automated Information Retriever for
Combing Unstructured Text (HAIRCUT) is a re-
search retrieval system developed at the Johns Hop-
kins University Applied Physics Laboratory. The
system was developed to investigate knowledge-
light methods for linguistic processing in text re-
trieval. HAIRCUT uses a statistical language model
of retrieval such as the one explored by Hiem-
stra (2001). The model ranks documents according
to the probability that the terms in a query are gen-
erated by a document. Various smoothing methods
have been proposed to combine the contributions for
each term based on the document model and also a
generic model of the language. Many have found
that a simple mixture model using document term
frequencies for the former, and occurrence statistics
from a large corpus for the later, works quite well.
McNamee and Mayfield (2001) have shown using
HAIRCUT that overlapping character n-grams are
effective for retrieval in non-Asian languages (e.g.,
using n=6) and that translingual retrieval between
closely related languages is quite feasible even with-
CLIR mean Average Precision
System No NE Transliteration Automatic NE Transliteration LDC NE Look-Up
Meng et al (2001) 0.514 0.522 NA
Haircut 0.501 0.515 0.506
Table 2: Cross-lingual retrieval performance with and without name transliteration
out translation resources of any kind (McNamee and
Mayfield, 2002).
For the task of retrieving Mandarin audio from
Chinese text queries on the TDT-2 task, the system
described by Meng et al (2000) achieved a mean av-
erage precision of 0.733 using character bigrams for
indexing. On identical queries, HAIRCUT achieved
0.762 using character bigrams. This figure forms the
monolingual baseline for our CLIR system.
3.2 Cross-Lingual Retrieval Performance
We first indexed the automatic transcription of the
TDT-2 Mandarin audio collection using character
bigrams, as done by Meng et al (2000). We per-
formed CLIR using the Chinese translations of the
English queries, with and without transliteration of
proper names, and compared the standard 11-step
mean average precision (mAP) on the TDT-2 audio
corpus. Our results and the corresponding results
from Meng et al (2001) are reported in Table 2.
Without name transliteration, the performance of
the two CLIR systems is nearly identical: a paired
t-test shows that the difference in the mAPs of 0.514
and 0.501 is significant only at a a0-value of 0.74.
A small improvement in mAP is obtained by the
Haircut system with name transliteration over the
system without name transliteration: the improve-
ment from 0.501 to 0.515 is statistically significant
at a a0-value of 0.084. The statistical significance of
the improvement from 0.514 to 0.522 by Meng et
al (2001) is not known to us. In any event, a need
for improvement in transliteration is suggested by
this result.
We recently received a large list of nearly 2M
Chinese-English named-entity pairs from the LDC.
As a pilot experiment, we simply added this list
to the translation lexicon of the CLIR system, i.e.,
we “translated” those names in our English queries
which happened to be available in this LDC list.
This happens to cover more than 85% of the pre-
viously untranslatable names in our queries. For the
remaining names, we continued to use our automatic
transliterator. To our surprise, the mAP improve-
ment from 0.501 to 0.506 was statistically insignif-
icant (a0-value of 0.421) and the reason why the use
of the ostensibly correct transliteration most of the
time still does not result in any significant gain in
CLIR performance continues to elude us.
We conjecture that the fact that the audio has been
processed by an automatic speech recognition sys-
tem, which in all likelihood did not have many of
the proper names in question in its vocabulary, may
be the cause of this dismal performance. It is plausi-
ble, though we cannot find a stronger justification for
it, that by using the 10-best transliterations produced
by our automatic system, we are adding robustness
against ASR errors in the retrieval of proper names.
4 A Large Chinese-English Translation
Table of Named Entities
The LDC Chinese-English named entity list was
compiled from Xinhua News sources, and consists
of nine pairs of lists, one each to cover person-
names, place-names, organizations, etc. While there
are indeed nearly 2 million name-pairs in this list, a
large number of formatting, character encoding and
other errors exist in this beta release, making it dif-
ficult to use the corpus as is in our statistical MT
system. We have tried using from this resource the
two lists corresponding to person-names and place-
names respectively, and have attempted to augment
the training data for our system described previously
in Section 2.1. However, we further screened these
lists as well in order to eliminate possible errors.
4.1 Extracting Named Entity Transliteration
Pairs for Translation Model Training
There are nearly 1 million pairs of person or place-
names in the LDC corpus. In order to obtain a
clean corpus of Named Entity transliterations we
performed the following steps:
1. We coverted all name-pairs into a parallel cor-
pus of English phonemes on one side and Chi-
nese GIFs on the other by the procedure de-
scribed earlier.
2. We trained a statistical MT system for trans-
lating from English phonemes to Chinese GIFs
from this corpus.
3. We then aligned all the (nearly 1M) training
“sentence” pairs with this translation model,
and extracted roughly a third of the sentences
with an alignment score above a certain tunable
threshold (a0a1a2a3). This resulted in the extrac-
tion of 346860 name-pairs.
4. We divided the set into 343738 pairs for train-
ing and 3122 for testing.
5. We estimated a pin-yin language model from
the training portion above.
6. We retrained the statistical MT system on this
presumably “good” training set and evaluated
the pin-yin error rate of the transliteration.
The result of this evaluation is reported in Table 3
against the line “Huge MT (Self),” where we also re-
port the transliteration performance of the so-called
Big MT system of Table 1 on this new test set. We
note, again with some dismay, that the additional
training data did not result in a significant improve-
ment in transliteration performance.
MT System Training Test Pin-yin
(Data filtered by) Size Size Errors
Big MT 3625 3122 51.1%
Huge MT (Itself) 343738 3122 51.5%
Huge MT (Big MT) 309019 3122 42.5%
Table 3: Pin-yin error rates for MT systems with
varying amounts of training data and different data
selection procedures.
We continue to believe that careful data-selection
is the key to successful use of this beta-release of the
LDC Named Entity corpus. We therefore went back
to Step 3 of the procedure outlined above, where we
had used alignment scores from an MT system to
select “good” sentence-pairs from our training data,
and instead of using the MT system trained in Step
2 immediately preceding it, we used the previously
built Big MT system of Section 2.1, which we know
is trained on a small but clean data-set of 3625 name-
pairs. With a similar threshold as above, we again
selected roughly 300K name-pairs, being careful to
leave out any pair which appears in the 3122 pair
test set described above, and reestimated the entire
phoneme-to-GIF translation system on this new cor-
pus. We evaluated this system on the 3122 name-
pair test set for transliteration performance, and the
results are included in Table 3.
Note that significant improvements in translitera-
tion performance result from this alternate method
of data selection.
4.2 Cross-Lingual Retrieval Performance — II
We reran the CLIR experiments on the TDT-2 cor-
pus using the somewhat improved entity translitera-
tor described above, with the same query and doc-
ument collection specifications as the experiments
reported in Table 2. The results of this second exper-
iment is reported in Table 4, where the performance
of the Big MT transliterator is reproduced for com-
parison.
Transliterator mean Average Precision
(Data filtered by) No NE Automatic NE
Big MT 0.501 0.515
Huge MT (Big MT) — 0.517
Table 4: Cross-lingual retrieval performance with
and without name transliteration
Note that the gain in CLIR performance is again
only somewhat significant, with the improvement in
mAP from 0.501 to 0.517 being significant only at a
a0-value of 0.080.
5 Concluding Remarks
We have presented a name transliteration procedure
based on statistical machine translation techniques
and have investigated its use in a cross lingual spo-
ken document retrieval task. We have found small
gains in the extrinsic evaluation of our procedure:
mAP improvement from 0.501 to 0.517. In a more
intrinsic and direct evaluation, we have found ways
to gainfully filter a large but noisy training corpus
to augment the training data for our models and im-
prove transliteration accuracy considerably beyond
our starting point, e.g., to reduce Pin-yin error rates
from 51.1% to 42.5%. We expect to further refine
the translation models in the future and apply them
in other tasks such as text translation.

References
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. 1993. The mathematics
of statistical machine translation: Parameter estima-
tion. Computational Linguistics, 19(2):263-311.
Sung Young Jung, SungLim Hong, and Eunok Paek.
2000. An English to Korean Transliteration Model of
Extended Markov Window. Proceedings of COLING.
K. Knight and J. Graehl. 1997. Machine Transliteration.
Proceedings of ACL.
Paul McNamee and Jim Mayfield. 2001. JHU/APL Ex-
periments at CLEF-2001: Translation Resources and
Score Normalization. Proceedings of CLEF.
Paul McNamee and Jim Mayfield. 2002. Comparing
Cross-Language Query Expansion Techniques by De-
grading Translation Resources. Proceedings of SIGIR.
Helen M. Meng et al˙. 2000. Mandarin-English Infor-
mation (MEI): Investigating Translingual Speech Re-
trieval. Technical Report for the Johns Hopkins Univ.
Summer Workshop.
Helen M. Meng, Wai-Kit Lo, Berlin Chen, and Karen
Tang. 2001. Generating Phonetic Cognates to Handle
Named Entities in English-Chinese Cross-Language
Spoken Document Retrieval. Proceedings of ASRU.
Practical Chinese Reader, Book I. The Commercial Press
LTD. 1981.
Bonnie Glover Stalls and Kevin Knight. 1998. Translat-
ing Names and Technical Terms in Arabic Text. Pro-
ceedings of the COLING/ACL Workshop on Computa-
tional Approaches to Semitic Languages.
Djoerd Hiemstra. 2001. Using Language Models
for Information Retrieval. Ph.D. thesis,University of
Twente, Netherlands.
