File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1672_metho.xml
Size: 11,332 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1672"> <Title>Discriminative Methods for Transliteration</Title> <Section position="4" start_page="0" end_page="612" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Name transliteration is an important task of transcribing a name from alphabet to another. For example, an Arabic &quot;mw &quot;, Korean &quot;wilrieom&quot;, and Russian &quot;Vil'iam &quot; all correspond to English &quot;William&quot;. We address the problem of transliteration in the general setting: it involves trying to recover original English names from their transcription in a foreign language, as well as finding an acceptable spelling of a foreign name in English. null We apply name transliteration in the context of cross-lingual information extraction. Name extractors are currently available in multiple languages. Our goal is to make the extracted names understandable to monolingual English speakers by transliterating the names into English.</Paragraph> <Paragraph position="1"> The extraction context of the transliteration application imposes additional complexity constraints on the task. In particular, we aim for the transliteration speed to be comparable to that of extraction speed. Since most current extraction systems are fairly fast (>1 Gb of text per hour), the complexity requirement reduces the range of techniques applicable to the transliteration. More precisely, we cannot use WWW and the web count information to hone in on the right transliteration candidate. Instead, all relevant transliteration information has to be represented within a compact and self-contained transliteration model.</Paragraph> <Paragraph position="2"> We present two methods for creating and applying transliteration models. In contrast to most previous transliteration approaches, our models are discriminative. Using an existing transliteration dictionary D (a set of name pairs {(f,e)}), we learn a function that directly maps a name f from one language into a name e in another language.</Paragraph> <Paragraph position="3"> We do not estimate either direct conditional p(e|f) or reverse conditional p(f|e) or joint p(e,f) probability models. Furthermore, we do away with the notion of alignment: our transliteration model does not require and is not defined of in terms of aligned e and f. Instead, all features used by the model are computed directly from the names f and e without any need for their alignment.</Paragraph> <Paragraph position="4"> The two discriminative methods that we present correspond to local and global modeling paradigms for solving complex learning problems with structured output spaces. In the local setting, we learn linear classifiers that predict a</Paragraph> <Paragraph position="6"> from the previously predicted letters and the original name f. In the global setting, we learn a function W mapping a pair (f,e) into a score W(f,e)[?] R. The function W is linear in features computed from the pair (f,e). We describe the pertinent feature spaces as well as pre- null sent both training and decoding algorithms for the local and global settings.</Paragraph> <Paragraph position="7"> We perform an experimental evaluation for three language pairs (transliteration from Arabic, Korean, and Russian into English) comparing our methods to a joint probabilistic modeling approach to transliteration, which was shown to deliver superior performance. We show experimentally that both discriminative methods out-perform the probabilistic approach, with global discriminative modeling achieving the best performance in all languages.</Paragraph> </Section> <Section position="5" start_page="612" end_page="612" type="metho"> <SectionTitle> 2 Preliminaries </SectionTitle> <Paragraph position="0"> Let E and F be two finite alphabets. We will use lowercase latin letters e, f to denote letters e[?]E, f[?]F, and we use bold letters e[?]E</Paragraph> <Paragraph position="2"> denote ith and jth symbols of the strings e and f, respectively. We use e[i,j] to represent a substring e</Paragraph> <Paragraph position="4"> of e. If j<i, then e[i,j] is an empty string L.</Paragraph> <Paragraph position="5"> A transliteration model is a function mapping a string f to a string e. We seek to learn a transliteration model from a transliteration dictionary D={(f,e)}. We apply the model in conjunction with a decoding algorithm that produces a string e from a string f.</Paragraph> </Section> <Section position="6" start_page="612" end_page="612" type="metho"> <SectionTitle> 3 Local Transliteration Modeling </SectionTitle> <Paragraph position="0"> In local transliteration modeling, we represent a transliteration model as a sequence of local prediction problems. For each local prediction, we use the history h representing the context of making a single transliteration prediction. That is, we</Paragraph> <Paragraph position="2"> based on the pair h=(e[1,i1], f) [?] H.</Paragraph> <Paragraph position="3"> Formally, we map HxE into a d-dimensional feature space ph: HxE - R d , where each ph k (h,e)(k[?]{1,..,d}) corresponds to a condition defined in terms of the history h and the currently predicted letter e. In order to model string termination, we augment E with a sentinel symbol $, and we append $ to each e from D.</Paragraph> <Paragraph position="4"> Given a transliteration dictionary D, we transform the dictionary in a set of |E |binary learning problems. Each learning problem L e corresponds to predicting a letter e[?]E. More precisely, for a pair (f[1,m],e[1,n]) [?] D and i [?] {1,...,n}, we generate a positive example ph((e[1,i-1], f),e</Paragraph> <Paragraph position="6"> Each of the learning problems is a binary classification problem and we can use our favorite binary classifier learning algorithm to induce a collection of binary classifiers {c e : e[?]E}. From most classifiers we can also obtain an estimate of conditional probability p(e|h) of a letter e given a history h.</Paragraph> <Paragraph position="7"> For decoding, in our experiments we use the beam search to find the sequence of letters (approximately) maximizing p(e|h).</Paragraph> <Section position="1" start_page="612" end_page="612" type="sub_section"> <SectionTitle> 3.1 Local Features </SectionTitle> <Paragraph position="0"> The features used in local transliteration modeling correspond to pairs of substrings of e and f.</Paragraph> <Paragraph position="1"> We limit the length of substrings as well as their relative location with respect to each other.</Paragraph> <Paragraph position="2"> * For ph((e[1,i-1], f),e), generate a feature for every pair of substrings (e[i-w,i-1],f[jv,j]), where 1[?]w<W(E) and 0[?]v<W(F) and |i-j |[?] d(E,F). Here, W(*) is the upper bound on the length of strings in the corresponding alphabet, and d(E,F) is the upper bound on the relative distance between substrings.</Paragraph> <Paragraph position="3"> * For ph((e[1,i-1], f[1,m]),e), generate the length difference feature ph len =i-m. In experiments, we discretize ph $,e$).</Paragraph> <Paragraph position="4"> The parameters W(E), W(F), and d(E,F) are, in general, language-specific, and we will show, in the experiments, that different values of the parameters are appropriate for different languages.</Paragraph> </Section> </Section> <Section position="7" start_page="612" end_page="613" type="metho"> <SectionTitle> 4 Global Transliteration Modeling </SectionTitle> <Paragraph position="0"> In global transliteration modeling, we directly model the agreement function between f and e.</Paragraph> <Paragraph position="1"> We follow (Collins 2002) and consider the global feature representation Ph: F</Paragraph> <Paragraph position="3"> Each global feature corresponds to a condition on the pair of strings. The value of a feature is the number of times the condition holds true for a given pair of strings. In particular, for every local feature ph</Paragraph> <Paragraph position="5"> We seek a transliteration model that is linear in the global features. Such a transliteration model is represented by d-dimensional weight vector W[?] R d . Given a string f, model application corresponds to finding a string e such that</Paragraph> <Paragraph position="7"> As with the case of local modeling, due to computational constraints, we use beam search for decoding in global transliteration modeling.</Paragraph> <Paragraph position="8"> (Collins 2002) showed how to use the Voted Perceptron algorithm for learning W, and we use it for learning the global transliteration model.</Paragraph> <Paragraph position="9"> We use beam search for decoding within the Voted Perceptron training as well.</Paragraph> <Section position="1" start_page="613" end_page="613" type="sub_section"> <SectionTitle> 4.1 Global Features </SectionTitle> <Paragraph position="0"> The global features used in local transliteration modeling directly correspond to local features described in Section 3.1.</Paragraph> <Paragraph position="1"> * For e[1,n] and f[1,m], generate a feature for every pair of substrings (e[i-w,i],f[jv,j]), where 1[?]w<W(E) and 0[?]v<W(F) and |i-j |[?] d(E,F).</Paragraph> <Paragraph position="2"> * For e[1,n] and f[1,m], generate the length difference feature Ph len =n-m. In experiments, we discretize Ph We compare the discriminative approaches to a joint probabilistic approach to transliteration introduced in recent years.</Paragraph> <Paragraph position="3"> In the joint probabilistic modeling approach, we estimate a probability distribution p(e,f). We also postulate hidden random variables a representing the alignment of e and f. An alignment a of e and f is a sequence a</Paragraph> <Paragraph position="5"> Note that we allow for at most one member of a pair a l to be an empty string. Given an alignment a, we define the joint We learn the probabilities p(e[i</Paragraph> <Paragraph position="7"> using a version of EM algorithm. In our experiments, we use the Viterbi version of the EM algorithm: starting from random alignments of all string pairs in D, we use maximum likelihood estimates of the above probabilities, which are then employed to induce the most probable alignments in terms of the probability estimates.</Paragraph> <Paragraph position="8"> The process is repeated until the probability estimates converge.</Paragraph> <Paragraph position="9"> During the decoding process, given a string f, we seek both a string e and an alignment a such that p(e,f|a) is maximized. In our experiments, we used beam search for decoding.</Paragraph> <Paragraph position="10"> Note that with joint probabilistic modeling use of a language model p(e) is not strictly necessary. Yet we found out experimentally that an adaptive combination of the language model with the joint probabilistic model improves the transliteration performance. We thus combine the joint log-likelihood log(p(e,f|a)) with log(p(e)):</Paragraph> <Paragraph position="12"> We estimate the parameter a on a held-out set by generating, for each f, the set of top K=10 candidates with respect to log(p(e,f|a)), then using (3) for re-ranking the candidates, and picking a to minimize the number of transliteration errors among re-ranked candidates.</Paragraph> </Section> </Section> class="xml-element"></Paper>