File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1021_metho.xml
Size: 25,186 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1021"> <Title>A Joint Source-Channel Model for Machine Transliteration</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Problems in transliteration </SectionTitle> <Paragraph position="0"> Transliteration is a process that takes a character string in source language as input and generates a character string in the target language as output.</Paragraph> <Paragraph position="1"> The process can be seen conceptually as two levels of decoding: segmentation of the source string into transliteration units; and relating the source language transliteration units with units in the target language, by resolving different combinations of alignments and unit mappings. A unit could be a Chinese character or a monograph, a digraph or a trigraph and so on for English.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Phoneme-based approach </SectionTitle> <Paragraph position="0"> The problems of English-Chinese transliteration have been studied extensively in the paradigm of noisy channel model (NCM). For a given English name E as the observed channel output, one seeks a posteriori the most likely Chinese transliteration C that maximizes P(C|E). Applying Bayes rule, it means to find C to maximize</Paragraph> <Paragraph position="2"> with equivalent effect. To do so, we are left with modeling two probability distributions: P(E|C), the probability of transliterating C to E through a noisy channel, which is also called transformation rules, and P(C), the probability distribution of source, which reflects what is considered good Chinese transliteration in general. Likewise, in C2E backtransliteration, we would find E that maximizes</Paragraph> <Paragraph position="4"> for a given Chinese name.</Paragraph> <Paragraph position="5"> In eqn (1) and (2), P(C) and P(E) are usually estimated using n-gram language models (Jelinek, 1991). Inspired by research results of grapheme-to-phoneme research in speech synthesis literature, many have suggested phoneme-based approaches to resolving P(E|C) and P(C|E), which approximates the probability distribution by introducing a phonemic representation. In this way, we convert the names in the source language, say E, into an intermediate phonemic representation P, and then convert the phonemic representation into the target language, say Chinese C. In E2C transliteration, the phoneme-based approach can be formulated as P(C|E) = P(C|P)P(P|E) and conversely we have P(E|C) = P(E|P)P(P|C) for C2E back-transliteration.</Paragraph> <Paragraph position="6"> Several phoneme-based techniques have been proposed in the recent past for machine transliteration using transformation-based learning algorithm (Meng et al. 2001; Jung et al, 2000; Virga & Khudanpur, 2003) and using finite state transducer that implements transformation rules (Knight & Graehl, 1998), where both handcrafted and data-driven transformation rules have been studied.</Paragraph> <Paragraph position="7"> However, the phoneme-based approaches are limited by two major constraints, which could compromise transliterating precision, especially in English-Chinese transliteration: 1) Latin-alphabet foreign names are of different origins. For instance, French has different phonic rules from those of English. The phoneme-based approach requires derivation of proper phonemic representation for names of different origins. One may need to prepare multiple language-dependent grapheme-to-phoneme (G2P) conversion systems accordingly, and that is not easy to achieve (The Onomastica Consortium, 1995). For example, /Lafontant/ is transliterated into La Feng Tang (La-Feng-Tang) while /Constant/ becomes Kang Si Tan Te (Kang-Si-Tan-Te) , where syllable /-tant/ in the two names are transliterated differently depending on the names' language of origin.</Paragraph> <Paragraph position="8"> 2) Suppose that language dependent grapheme-to-phoneme systems are attainable, obtaining Chinese orthography will need two further steps: a) conversion from generic phonemic representation to Chinese Pinyin; b) conversion from Pinyin to Chinese characters. Each step introduces a level of imprecision. Virga and Khudanpur (2003) reported 8.3% absolute accuracy drops when converting from Pinyin to Chinese characters, due to homophone confusion. Unlike Japanese katakana or Korean alphabet, Chinese characters are more ideographic than phonetic. To arrive at an appropriate Chinese transliteration, one cannot rely solely on the intermediate phonemic representation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Useful orthographic context </SectionTitle> <Paragraph position="0"> To illustrate the importance of contextual information in transliteration, let's take name /Minahan/ as an example, the correct segmentation should be /Mi-na-han/, to be transliterated as Mi - null According to the transliteration guidelines, a wise segmentation can be reached only after exploring the combination of the left and right context of transliteration units. From the computational point of view, this strongly suggests using a contextual n-gram as the knowledge base for the alignment decision.</Paragraph> <Paragraph position="1"> Another example will show us how one-to-many mappings could be resolved by context. Let's take another name /Smith/ as an example. Although we can arrive at an obvious segmentation /s-mi-th/, there are three Chinese characters for each of /s-/, /-mi-/ and /-th/. Furthermore, /s-/ and /-th/ correspond to overlapping characters as well, as shown next.</Paragraph> <Paragraph position="2"> A human translator will use transliteration rules between English syllable sequence and Chinese character sequence to obtain the best mapping Shi Mi -Si , as indicated in italic in the table above. To address the issues in transliteration, we propose a direct orthographic mapping (DOM) framework through a joint source-channel model by fully exploring orthographic contextual information, aiming at alleviating the imprecision introduced by the multiple-step phoneme-based approach.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="111" type="metho"> <SectionTitle> 3 Joint source-channel model </SectionTitle> <Paragraph position="0"> In view of the close coupling of the source and target transliteration units, we propose to estimate P(E,C) by a joint source-channel model, or n-gram transliteration model (TM). For K aligned transliteration units, we have</Paragraph> <Paragraph position="2"> which provides an alternative to the phoneme-based approach for resolving eqn. (1) and (2) by eliminating the intermediate phonemic representation.</Paragraph> <Paragraph position="3"> Unlike the noisy-channel model, the joint source-channel model does not try to capture how source names can be mapped to target names, but rather how source and target names can be generated simultaneously. In other words, we estimate a joint probability model that can be easily marginalized in order to yield conditional probability models for both transliteration and back-transliteration.</Paragraph> <Paragraph position="4"> Suppose that we have an English name</Paragraph> <Paragraph position="6"> =a and a Chinese transliteration</Paragraph> <Paragraph position="8"> Chinese characters. Oftentimes, the number of letters is different from the number of Chinese characters. A Chinese character may correspond to a letter substring in English or vice versa.</Paragraph> <Paragraph position="10"> where there exists an alignment g with >=<><</Paragraph> <Paragraph position="12"> correspondence >< ce, is called a transliteration pair. Then, the E2C transliteration can be formulated as</Paragraph> <Paragraph position="14"> An n-gram transliteration model is defined as the conditional probability, or transliteration probability, of a transliteration pair</Paragraph> <Paragraph position="16"> depending on its immediate n predecessor pairs:</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 3.1 Transliteration alignment </SectionTitle> <Paragraph position="0"> A bilingual dictionary contains entries mapping English names to their respective Chinese transliterations. Like many other solutions in computational linguistics, it is possible to automatically analyze the bilingual dictionary to acquire knowledge in order to map new English names to Chinese and vice versa. Based on the transliteration formulation above, a transliteration model can be built with transliteration unit's n-gram statistics. To obtain the statistics, the bilingual dictionary needs to be aligned. The maximum likelihood approach, through EM algorithm (Dempster, 1977), allows us to infer such an alignment easily as described in the table below.</Paragraph> <Paragraph position="1"> The aligning process is different from that of transliteration given in eqn. (4) or (5) in that, here we have fixed bilingual entries, a and b . The aligning process is just to find the alignment segmentation g between the two strings that maximizes the joint probability:</Paragraph> <Paragraph position="3"> A set of transliteration pairs that is derived from the aligning process forms a transliteration table, which is in turn used in the transliteration decoding. As the decoder is bounded by this table, it is important to make sure that the training database covers as much as possible the potential transliteration patterns. Here are some examples of resulting alignment pairs.</Paragraph> <Paragraph position="4"> Knowing that the training data set will never be sufficient for every n-gram unit, different smoothing approaches are applied, for example, by using backoff or class-based models, which can be found in statistical language modeling literatures (Jelinek, 1991).</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 3.2 DOM: n-gram TM vs. NCM </SectionTitle> <Paragraph position="0"> Although in the literature, most noisy channel models (NCM) are studied under phoneme-based paradigm for machine transliteration, NCM can also be realized under direct orthographic mapping (DOM). Next, let's look into a bigram case to see what n-gram TM and NCM present to us. For E2C conversion, re-writing eqn (1) and eqn (6) , we</Paragraph> <Paragraph position="2"> The formulation of eqn. (8) could be interpreted as a hidden Markov model with Chinese characters as its hidden states and English transliteration units as the observations (Rabiner, 1989). The number of parameters in the bigram TM is potentially</Paragraph> <Paragraph position="4"> while in the noisy channel model (NCM) it's CT + , where T is the number of transliteration pairs and C is the number of Chinese transliteration units. In eqn. (9), the current transliteration depends on both Chinese and English transliteration history while in eqn. (8), it depends only on the previous Chinese unit.</Paragraph> <Paragraph position="5"> As CTT +>> , an n-gram TM gives a finer description than that of NCM. The actual size of models largely depends on the availability of training data. In Table 1, one can get an idea of how they unfold in a real scenario. With adequately sufficient training data, n-gram TM is expected to outperform NCM in the decoding. A perplexity study in section 4.1 will look at the model from another perspective.</Paragraph> </Section> </Section> <Section position="5" start_page="111" end_page="111" type="metho"> <SectionTitle> 4 The experiments </SectionTitle> <Paragraph position="0"> We use a database from the bilingual dictionary &quot;Chinese Transliteration of Foreign Personal Names&quot; which was edited by Xinhua News Agency and was considered the de facto standard of personal name transliteration in today's Chinese press. The database includes a collection of 37,694 unique English entries and their official Chinese transliteration. The listing includes personal names of English, French, Spanish, German, Arabic, Russian and many other origins.</Paragraph> <Paragraph position="1"> The database is initially randomly distributed into 13 subsets. In the open test, one subset is withheld for testing while the remaining 12 subsets are used as the training materials. This process is repeated 13 times to yield an average result, which is called the 13-fold open test. After experiments, we found that each of the 13-fold open tests gave consistent error rates with less than 1% deviation.</Paragraph> <Paragraph position="2"> Therefore, for simplicity, we randomly select one of the 13 subsets, which consists of 2896 entries, as the standard open test set to report results. In the close test, all data entries are used for training and testing.</Paragraph> <Paragraph position="3"> 1. Bootstrap initial random alignment 2. Expectation: Update n-gram statistics to estimate probability distribution 3. Maximization: Apply the n-gram TM to obtain new alignment 4. Go to step 2 until the alignment converges 5. Derive a list transliteration units from final alignment as transliteration table</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 4.1 Modeling </SectionTitle> <Paragraph position="0"> The alignment of transliteration units is done fully automatically along with the n-gram TM training process. To model the boundary effects, we introduce two extra units <s> and </s> for start and end of each name in both languages. The EM iteration converges at 8 th round when no further alignment changes are reported. Next are some statistics as a result of the model training: The most common metric for evaluating an n-gram model is the probability that the model assigns to test data, or perplexity (Jelinek, 1991). For a test set W composed of V names, where each name has been aligned into a sequence of transliteration pair tokens, we can calculate the of a model is the reciprocal of the average probability assigned by the model to each aligned pair in the test set W as</Paragraph> <Paragraph position="2"> Clearly, lower perplexity means that the model describes better the data. It is easy to understand that closed test always gives lower perplexity than open test.</Paragraph> <Paragraph position="3"> We have the perplexity reported in Table 2 on the aligned bilingual dictionary, a database of 119,364 aligned tokens. The NCM perplexity is computed using n-gram equivalents of eqn. (8) for E2C transliteration, while TM perplexity is based on those of eqn (9) which applies to both E2C and C2E. It is shown that TM consistently gives lower perplexity than NCM in open and closed tests. We have good reason to expect TM to provide better transliteration results which we expect to be confirmed later in the experiments.</Paragraph> <Paragraph position="4"> The Viterbi algorithm produces the best sequence by maximizing the overall probability, ),,( gbaP . In CLIR or multilingual corpus alignment (Virga and Khudanpur, 2003), N-best results will be very helpful to increase chances of correct hits. In this paper, we adopted an N-best stack decoder (Schwartz and Chow, 1990) in both TM and NCM experiments to search for N-best results. The algorithm also allows us to apply higher order n-gram such as trigram in the search.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 4.2 E2C transliteration </SectionTitle> <Paragraph position="0"> In this experiment, we conduct both open and closed tests for TM and NCM models under DOM paradigm. Results are reported in Table 3 and In word error report, a word is considered correct only if an exact match happens between transliteration and the reference. The character error rate is the sum of deletion, insertion and substitution errors. Only the top choice in N-best results is used for error rate reporting. Not surprisingly, one can see that n-gram TM, which benefits from the joint source-channel model coupling both source and target contextual information into the model, is superior to NCM in all the test cases.</Paragraph> </Section> <Section position="3" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 4.3 C2E back-transliteration </SectionTitle> <Paragraph position="0"> The C2E back-transliteration is more challenging than E2C transliteration. Not many studies have been reported in this area. It is common that multiple English names are mapped into the same Chinese transliteration. In Table 1, we see only 28,632 unique Chinese transliterations exist for 37,694 English entries, meaning that some phonemic evidence is lost in the process of transliteration. To better understand the task, let's compare the complexity of the two languages presented in the bilingual dictionary.</Paragraph> <Paragraph position="1"> Table 1 also shows that the 5,640 transliteration pairs are cross mappings between 3,683 English and 374 Chinese units. In order words, on average, for each English unit, we have 1.53 = 5,640/3,683 Chinese correspondences. In contrast, for each Chinese unit, we have 15.1 = 5,640/374 English back-transliteration units! Confusion is increased tenfold going backward.</Paragraph> <Paragraph position="2"> The difficulty of back-transliteration is also reflected by the perplexity of the languages as in Table 5. Based on the same alignment tokenization, we estimate the monolingual language perplexity for Chinese and English independently using the n-gram language models surprise, Chinese names have much lower perplexity than English names thanks to fewer Chinese units. This contributes to the success of E2C but presents a great challenge to C2E backtransliteration. null</Paragraph> </Section> <Section position="4" start_page="111" end_page="111" type="sub_section"> <SectionTitle> tests </SectionTitle> <Paragraph position="0"> A back-transliteration is considered correct if it falls within the multiple valid orthographically correct options. Experiment results are reported in Table 6. As expected, C2E error rate is much higher than that of E2C.</Paragraph> <Paragraph position="1"> In this paper, the n-gram TM model serves as the sole knowledge source for transliteration.</Paragraph> <Paragraph position="2"> However, if secondary knowledge, such as a lookup table of valid target transliterations, is available, it can help reduce error rate by discarding invalid transliterations top-down the N choices. In Table 7, the word error rates for both E2C and C2E are reported which imply potential error reduction by secondary knowledge source.</Paragraph> <Paragraph position="3"> The N-best error rates are reduced significantly at 10-best level as reported in Table 7.</Paragraph> </Section> </Section> <Section position="6" start_page="111" end_page="111" type="metho"> <SectionTitle> 5 Discussions </SectionTitle> <Paragraph position="0"> It would be interesting to relate n-gram TM to other related framework.</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 5.1 DOM: n-gram TM vs. ID3 </SectionTitle> <Paragraph position="0"> In section 4, one observes that contextual information in both source and target languages is essential. To capture them in the modeling, one could think of decision tree, another popular machine learning approach. Under the DOM framework, here is the first attempt to apply decision tree in E2C and C2E transliteration.</Paragraph> <Paragraph position="1"> With the decision tree, given a fixed size learning vector, we used top-down induction trees to predict the corresponding output. Here we implement ID3 (Quinlan, 1993) algorithm to construct the decision tree which contains questions and return values at terminal nodes.</Paragraph> <Paragraph position="2"> Similar to n-gram TM, for unseen names in open test, ID3 has backoff smoothing, which lies on the default case which returns the most probable value as its best guess for a partial tree path according to the learning set.</Paragraph> <Paragraph position="3"> In the case of E2C transliteration, we form a learning vector of 6 attributes by combining 2 left and 2 right letters around the letter of focus illustrated in Table 8, where both English and Chinese contexts are used to infer a Chinese character. Similarly, 4 attributes combining 1 left, 1 centre and 1 right Chinese character and 1 previous English unit are used for the learning vector in C2E test. An aligned bilingual dictionary is needed to build the decision tree.</Paragraph> <Paragraph position="4"> To minimize the effects from alignment variation, we use the same alignment results from section 4. Two trees are built for two directions, E2C and C2E. The results are compared with those</Paragraph> <Paragraph position="6"> One observes that n-gram TM consistently outperforms ID3 decision tree in all tests. Three factors could have contributed: 1) English transliteration unit size ranges from 1 letter to 7 letters. The fixed size windows in ID3 obviously find difficult to capture the dynamics of various ranges. n-gram TM seems to have better captured the dynamics of transliteration units; 2) The backoff smoothing of n-gram TM is more effective than that of ID3; 3) Unlike n-gram TM, ID3 requires a separate aligning process for bilingual dictionary. The resulting alignment may not be optimal for tree construction. Nevertheless, ID3 presents another successful implementation of DOM framework.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 5.2 DOM vs. phoneme-based approach </SectionTitle> <Paragraph position="0"> Due to lack of standard data sets, it is difficult to compare the performance of the n-gram TM to that of other approaches. For reference purpose, we list some reported studies on other databases of E2C transliteration tasks in Table 10. As in the references, only character and Pinyin error rates are reported, we only include our character and Pinyin error rates for easy reference. The reference data are extracted from Table 1 and 3 of (Virga and Khudanpur 2003). As we have not found any C2E result in the literature, only E2C results are compared here.</Paragraph> <Paragraph position="1"> The first 4 setups by Virga et al all adopted the phoneme-based approach in the following steps: 1) English name to English phonemes; 2) English phonemes to Chinese Pinyin; 3) Chinese Pinyin to Chinese characters.</Paragraph> <Paragraph position="2"> It is obvious that the n-gram TM compares favorably to other techniques. n-gram TM presents an error reduction of 74.6%=(42.5-10.8)/42.5% for Pinyin over the best reported result, Huge MT (Big MT) test case, which is noteworthy.</Paragraph> <Paragraph position="3"> The DOM framework shows a quantum leap in performance with n-gram TM being the most successful implementation. The n-gram TM and ID3 under direct orthographic mapping (DOM) paradigm simplify the process and reduce the chances of conversion errors. As a result, n-gram TM and ID3 do not generate Chinese Pinyin as intermediate results. It is noted that in the 374 legitimate Chinese characters for transliteration, character to Pinyin mapping is unique while Pinyin to character mapping could be one to many. Since we have obtained results in character already, we expect less Pinyin error than character error should a character-to-Pinyin mapping be needed.</Paragraph> </Section> </Section> <Section position="7" start_page="111" end_page="111" type="metho"> <SectionTitle> studies 6 Conclusions </SectionTitle> <Paragraph position="0"> In this paper, we propose a new framework (DOM) for transliteration. n-gram TM is a successful realization of DOM paradigm. It generates probabilistic orthographic transformation rules using a data driven approach. By skipping the intermediate phonemic interpretation, the transliteration error rate is reduced significantly. Furthermore, the bilingual aligning process is integrated into the decoding process in n-gram TM, which allows us to achieve a joint optimization of alignment and transliteration automatically. Unlike other related work where pre-alignment is needed, the new framework greatly reduces the development efforts of machine transliteration systems. Although the framework is implemented on an English-Chinese personal name data set, without loss of generality, it well applies to transliteration of other language pairs such as English/Korean and English/Japanese.</Paragraph> <Paragraph position="1"> It is noted that place and company names are sometimes translated in combination of transliteration and meanings, for example, /Victoria-Fall/ becomes Wei Duo Li Ya Pu Bu (Pinyin:Wei Duo Li Ya Pu Bu). As the proposed framework allows direct orthographical mapping, it can also be easily extended to handle such name translation. We expect to see the proposed model to be further explored in other related areas.</Paragraph> </Section> class="xml-element"></Paper>