File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0317_intro.xml
Size: 3,162 bytes
Last Modified: 2025-10-06 14:01:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0317"> <Title>Acquisition of English-Chinese Transliterated Word Pairs from Parallel- Aligned Texts using a Statistical Machine Transliteration Model</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic bilingual lexicon construction based on bi-lingual corpora has become an important first step for many studies and applications of natural language processing (NLP), such as machine translation (MT), cross-language information retrieval (CLIR), and bilingual text alignment. As noted in Tsuji (2002), many previous methods (Dagan et al., 1993; Kupiec, 1993; Wu and Xia, 1994; Melamed, 1996; Smadja et al., 1996) deal with this problem based on frequency of words appearing in the corpora, which can not be effectively applied to low-frequency words, such as transliterated words. These transliterated words are often domain-specific and created frequently. Many of them are not found in existing bilingual dictionaries. Thus, it is difficult to handle transliteration only via simple dictionary lookup. For CLIR, the accuracy of transliteration highly affects the performance of retrieval.</Paragraph> <Paragraph position="1"> In this paper, we present a framework of acquisition for English and Chinese transliterated word pairs based on the proposed statistical machine transliteration model.</Paragraph> <Paragraph position="2"> Recently, much research has been done on machine transliteration for many language pairs, such as English/Arabic (Al-Onaizan and Knight, 2002), English/Chinese (Chen et al., 1998; Lin and Chen, 2002; Wan and Verspoor, 1998), English/Japanese (Knight and Graehl, 1998), and English/Korean (Lee and Choi, 1997; Oh and Choi, 2002). Most previous approaches to machine transliteration have focused on the use of a pronunciation dictionary for converting source words into phonetic symbols, a manually assigned scoring matrix for measuring phonetic similarities between source and target words, or a method based on heuristic rules for source-to-target word transliteration. However, words with unknown pronunciations may cause problems for transliteration. In addition, using either a language-dependent penalty function to measure the similarity between bilingual word pairs, or handcrafted heuristic mapping rules for transliteration may lead to problems when porting to other language pairs.</Paragraph> <Paragraph position="3"> The proposed method in this paper requires no conversion of source words into phonetic symbols. The model is trained automatically on a bilingual proper name list via unsupervised learning.</Paragraph> <Paragraph position="4"> The remainder of the paper is organized as follows: Section 2 gives an overview of machine transliteration and describes the proposed model. Section 3 describes how to apply the model for extraction of transliterated target words from parallel texts. Experimental setup and quantitative assessment of performance are presented in Section 4. Concluding remarks are made in Section 5.</Paragraph> </Section> class="xml-element"></Paper>