File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/p06-2025_abstr.xml

Size: 7,508 bytes

Last Modified: 2025-10-06 13:45:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2025">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Modified Joint Source-Channel Model for Transliteration Asif Ekbal</Title>
  <Section position="2" start_page="0" end_page="192" type="abstr">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In Natural Language Processing (NLP) application areas such as information retrieval, question answering systems and machine translation, there is an increasing need to translate OOV words from one language to another. They are translated through transliteration, the method of translating into another language by expressing the original foreign words using characters of the target language preserving the pronunciation in their original languages. Thus, the central problem in transliteration is predicting the pronunciation of the original word. Transliteration between two languages, that use the same set of alphabets, is trivial: the word is left as it is. However, for languages that use different alphabet sets, the names must be transliterated or rendered in the target language alphabets.</Paragraph>
    <Paragraph position="1"> Technical terms and named entities make up the bulk of these OOV words. Named entities hold a very important place in NLP applications. Proper identification, classification and translation of named entities are very crucial in many NLP applications and pose a very big challenge to NLP researchers. Named entities are usually not found in bilingual dictionaries and they are very productive in nature. Translation of named entities is a tricky task: it involves both translation and transliteration. Transliteration is commonly used for named entities, even when the words could be translated. Different types of named entities are translated differently.</Paragraph>
    <Paragraph position="2"> Numerical and temporal expressions typically use a limited set of vocabulary words (e.g., names of months, days of the week etc.) and can be translated fairly easily using simple translation patterns. The named entity machine transliteration algorithms presented in this work  focus on person names, locations and organizations. A machine transliteration system that is trained on person names is very important in a multilingual country like India where large name collections like census data, electoral roll and railway reservation information must be available to multilingual citizens of the country in their vernacular. In the present work, the various proposed models have been evaluated on a training corpus of person names.</Paragraph>
    <Paragraph position="3"> A hybrid neural network and knowledge-based system to generate multiple English spellings for Arabic personal names is described in (Arbabi et al., 1994). (Knight and Graehl, 1998) developed a phoneme-based statistical model using finite state transducer that implements transformation rules to do back-transliteration. (Stalls and Knight, 1998) adapted this approach for back transliteration from Arabic to English for English names. A spelling-based model is described in (Al-Onaizan and Knight, 2002a; Al-Onaizan and Knight, 2002c) that directly maps English letter sequences into Arabic letter sequences with associated probability that are trained on a small English/Arabic name list without the need for English pronunciations. The phonetics-based and spelling-based models have been linearly combined into a single transliteration model in (Al-Onaizan and Knight, 2002b) for transliteration of Arabic named entities into English.</Paragraph>
    <Paragraph position="4"> Several phoneme-based techniques have been proposed in the recent past for machine transliteration using transformation-based learning algorithm (Meng et al., 2001; Jung et al., 2000; Vigra and Khudanpur, 2003).</Paragraph>
    <Paragraph position="5"> (Abduljaleel and Larkey, 2003) have presented a simple statistical technique to train an English-Arabic transliteration model from pairs of names. The two-stage training procedure first learns which n-gram segments should be added to unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristic or linguistic knowledge of either language.</Paragraph>
    <Paragraph position="6"> (Goto et al., 2003) described an English-Japanese transliteration method in which an English word is divided into conversion units that are partial English character strings in an English word and each English conversion unit is converted into a partial Japanese Katakana character string. It calculates the likelihood of a particular choice of letters of chunking into English conversion units for an English word by linking them to Katakana characters using syllables. Thus the English conversion units consider phonetic aspects. It considers the English and Japanese contextual information simultaneously to calculate the plausibility of conversion from each English conversion unit to various Japanese conversion units using a single probability model based on the maximum entropy method.</Paragraph>
    <Paragraph position="7"> (Haizhou et al., 2004) presented a framework that allows direct orthographical mapping between English and Chinese through a joint source-channel model, called n-gram transliteration model. The orthographic alignment process is automated using the maximum likelihood approach, through the Expectation Maximization algorithm to derive aligned transliteration units from a bilingual dictionary. The joint source-channel model tries to capture how source and target names can be generated simultaneously, i.e., the context information in both the source and the target sides are taken into account.</Paragraph>
    <Paragraph position="8"> A tuple n-gram transliteration model (Marino et al., 2005; Crego et al., 2005) has been loglinearly combined with feature functions to develop a statistical machine translation system for Spanish-to-English and English-to-Spanish translation tasks. The model approximates the joint probability between source and target languages by using trigrams.</Paragraph>
    <Paragraph position="9"> The present work differs from (Goto et al., 2003; Haizhou et al., 2004) in the sense that identification of the transliteration units in the source language is done using regular expressions and no probabilistic model is used.</Paragraph>
    <Paragraph position="10"> The proposed modified joint source-channel model is similar to the model proposed by (Goto et. al., 2003) but it differs in the way the transliteration units and the contextual information are defined in the present work. No linguistic knowledge is used in (Goto et al., 2003; Haizhou et al., 2004) whereas the present work uses linguistic knowledge in the form of possible conjuncts and diphthongs in Bengali.</Paragraph>
    <Paragraph position="11"> The paper is organized as follows. The machine transliteration problem has been formulated under both noisy-channel model and joint source-channel model in Section 2. A number of transliteration models based on collocation statistics including the modified joint source-channel model and their evaluation scheme have been proposed in Section 3. The Bengali-English machine transliteration scenario has been presented in Section 4. The proposed  models have been evaluated and the result of evaluation is reported in Section 5. The conclusion is drawn in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML