File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1508_intro.xml
Size: 4,153 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1508"> <Title>Transliteration of Proper Names in Cross-Lingual Information Retrieval</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Translation of proper names is generally recognized as a significant problem in many multi-lingual text and speech processing applications. Even when hand-crafted translation lexicons used for machine translation (MT) and cross-lingual information retrieval (CLIR) provide significant coverage of the words encountered in the text, a significant portion of the tokens not covered by the lexicon are proper names and domain-specific terminology (cf., e.g., Meng et al (2000)). This lack of translations adversely affects performance. For CLIR applications in particular, proper names and technical terms are especially important, as they carry the most distinctive information in a query as corroborated by their relatively low document frequency. Finally, in interactive IR systems where users provide very short queries (e.g. 2-5 words), their importance grows even further.</Paragraph> <Paragraph position="1"> Unlike specialized terminology, however, proper names are amenable to a speech-inspired translation approach. One tries, when writing foreign names in ones own language, to preserve the way it sounds.</Paragraph> <Paragraph position="2"> i.e. one uses an orthographic representation which, when &quot;read aloud&quot; by a speaker of ones language sounds as much like it would when spoken by a speaker of the foreign language -- a process referred to as transliteration. Therefore, if a mechanism were available to render, say, an English name in its phonemic form, and another mechanism were available to convert this phonemic string into the orthography of, say, Chinese, then one would have a mechanism for transliterating English names using Chinese characters. The first step has been addressed extensively, for other obvious reasons, in the automatic speech synthesis literature. This paper describes a statistical approach for the second step.</Paragraph> <Paragraph position="3"> Several techniques have been proposed in the recent past for name transliteration. Rather than providing a comprehensive survey we highlight a few representative approaches here. Finite state transducers that implement transformation rules for back-transliteration from Japanese to English have been described by Knight and Graehl (1997), and extended to Arabic by Glover-Stalls and Knight (1998). In both cases, the goal is to recognize words in Japanese or Arabic text which hap- null pen to be transliterations of English names. If the orthography of a language is strongly phonetic, as is the case for Korean, then one may use relatively simple hidden Markov models to transform English pronunciations, as shown by Jung et al (2000). The work closest to our application scenario, and the one with which we will be making several direct comparisons, is that of Meng et al (2001). In their work, a set of hand-crafted transformations for locally editing the phonemic spelling of an English word to conform to rules of Mandarin syllabification are used to seed a transformation-based learning algorithm.</Paragraph> <Paragraph position="4"> The algorithm examines some data and learns the proper sequence of application of the transformations to convert an English phoneme sequence to a Mandarin syllable sequence. Our paper describes a data driven counterpart to this technique, in which a cascade of two source-channel translation models is used to go from English names to their Chinese transliteration. Thus even the initial requirement of creating candidate transformation rules, which may require knowledge of the phonology of the target language, is eliminated.</Paragraph> <Paragraph position="5"> We also investigate incorporation of this transliteration system in a cross-lingual spoken document retrieval application, in which English text queries are used to index and retrieve Mandarin audio from the TDT corpus.</Paragraph> </Section> class="xml-element"></Paper>