File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1010_metho.xml
Size: 12,171 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1010"> <Title>Named Entity Transliteration with Comparable Corpora</Title> <Section position="5" start_page="73" end_page="75" type="metho"> <SectionTitle> 3 Chinese Transliteration with </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="73" end_page="74" type="sub_section"> <SectionTitle> Comparable Corpora </SectionTitle> <Paragraph position="0"> We assume that we have comparable corpora, consisting of newspaper articles in English and Chinese from the same day, or almost the same day. In our experiments we use data from the English and Chinese stories from the Xinhua News agency for about 6 months of 2001.2 We assume that we have identified names for persons and locations--two types that have a strong tendency to be transliterated wholly or mostly phonetically--in the English text; in this work we use the named-entity recognizer described in (Li et al., 2004), which is based on the SNoW machine learning toolkit (Carlson et al., 1999).</Paragraph> <Paragraph position="1"> To perform the transliteration task, we propose the following general three-step approach: 1. Given an English name, identify candidate Chinese character n-grams as possible transliterations.</Paragraph> <Paragraph position="2"> 2. Score each candidate based on how likely the candidate is to be a transliteration of the English name. We propose two different scoring methods. The first involves phonetic scoring, and the second uses the frequency profile of the candidate pair over time. We will show that each of these approaches works quite well, but by combining the approaches one can achieve even better results.</Paragraph> <Paragraph position="3"> 3. Propagate scores of all the candidate transliteration pairs globally based on their co-occurrences in document pairs in the comparable corpora.</Paragraph> <Paragraph position="4"> The intuition behind the third step is the following. Suppose several high-confidence name transliteration pairs occur in a pair of English and Chinese documents. Intuitively, this would increase our confidence in the other plausible transliteration pairs in the same document pair. We thus propose a score propagation method to allow these high-confidence pairs to propagate some of their scores to other co-occurring transliteration pairs. As we will show later, such a propagation strategy can generally further improve the transliteration accuracy; in particular, it can further improve the already high performance from combining the two scoring methods.</Paragraph> </Section> <Section position="2" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.1 Candidate Selection </SectionTitle> <Paragraph position="0"> The English named entity candidate selection process was already described above. Candidate Chinese transliterations are generated by consulting a list of characters that are frequently used for transliterating foreign names. As discussed elsewhere (Sproat et al., 1996), a subset of a few hundred characters (out of several thousand) tends to be used overwhelmingly for transliterating foreign names into Chinese. We use a list of 495 such characters, derived from various online dictionaries. A sequence of three or more characters from the list is taken as a possible name. If the character &quot;EK&quot; occurs, which is frequently used to represent the space between parts of an English name, then at least one character to the left and right of this character will be collected, even if the character in question is not in the list of &quot;foreign&quot; characters. Armed with the English and Chinese candidate lists, we then consider the pairing of every English candidate with every Chinese candidate. Obviously it would be impractical to do this for all of the candidates generated for, say, an entire year: we consider as plausible pairings those candidates that occur within a day of each other in the two corpora.</Paragraph> </Section> <Section position="3" start_page="74" end_page="74" type="sub_section"> <SectionTitle> 3.2 Candidate scoring based on </SectionTitle> <Paragraph position="0"> pronunciation We adopt a source-channel model for scoring English-Chinese transliteration pairs. In general, we seek to estimate P(e|c), where e is a word in Roman script, and c is a word in Chinese script. Since Chinese transliteration is mostly based on pronunciation, we estimate P(e'|c'), where e' is the pronunciation of e and c' is the pronunciation of c. Again following standard practice, we decompose the estimate of P(e'|c') as P(e'|c') =producttext i P(e'i|c'i). Here, e'i is the ith subsequence of the English phone string, and c'i is the ith subsequence of the Chinese phone string. Since Chinese transliteration attempts to match the syllablesized characters to equivalent sounding spans of the English language, we fix the c'i to be syllables, and let the e'i range over all possible subsequences of the English phone string. For training data we have a small list of 721 names in Roman script and their Chinese equivalent.3 Pronunciations for English words are obtained using the Festival text-to-speech system (Taylor et al., 1998); for Chinese, we use the standard pinyin transliteration of the characters. English-Chinese pairs in our training dictionary were aligned using the alignment algorithm from (Kruskal, 1999), and a hand-derived set of 21 rules-of-thumb: for example, we have rules that encode the fact that Chinese /l/ can correspond to English /r/, /n/ or /er/; and that Chinese /w/ may be used to represent /v/. Given that there are over 400 syllables in Mandarin (not counting tone) and each of these syllables can match a large number of potential English phone spans, this is clearly not enough training data to cover all the parameters, and so we use Good-Turing estimation to estimate probabilities for unseen correspondences. Since we would like to filter implausible transliteration pairs we are less lenient than standard estimation techniques in that we are willing to assign zero probability to some correspondences. Thus we set a hard rule that for an English phone span to correspond to a Chinese syllable, the initial phone of the English span must have been seen in the training data as corresponding to the initial of the Chinese syllable some minimum number of times. For consonant-initial syllables we set the minimum to 4. We omit further details of our estimation technique for lack of space. This phonetic correspondence model can then be used to score putative transliteration pairs.</Paragraph> </Section> <Section position="4" start_page="74" end_page="75" type="sub_section"> <SectionTitle> 3.3 Candidate Scoring based on Frequency Correlation </SectionTitle> <Paragraph position="0"> Names of the same entity that occur in different languages often have correlated frequency patterns due to common triggers such as a major event.</Paragraph> <Paragraph position="1"> Thus if we have comparable news articles over a sufficiently long time period, it is possible to exploit such correlations to learn the associations of names in different languages. The idea of exploiting frequency correlation has been well studied.</Paragraph> <Paragraph position="2"> (See the previous work section.) We adopt the method proposed in (Tao and Zhai, 2005), which 3The LDC provides a much larger list of transliterated Chinese-English names, but we did not use this here for two reasons. First, we have found it it be quite noisy. Secondly, we were interested in seeing how well one could do with a limited resource of just a few hundred names, which is a more realistic scenario for languages that have fewer resources than English and Chinese.</Paragraph> <Paragraph position="3"> works as follows: We pool all documents in a single day to form a large pseudo-document. Then, for each transliteration candidate (both Chinese and English), we compute its frequency in each of those pseudo-documents and obtain a raw frequency vector. We further normalize the raw frequency vector so that it becomes a frequency distribution over all the time points (days). In order to compute the similarity between two distribution vectors, The Pearson correlation coefficient was used in (Tao and Zhai, 2005); here we also considered two other commonly used measures - cosine (Salton and McGill, 1983), and Jensen-Shannon divergence (Lin, 1991), though our results show that Pearson correlation coefficient performs better than these two other methods.</Paragraph> </Section> <Section position="5" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.4 Score Propagation </SectionTitle> <Paragraph position="0"> In both scoring methods described above, scoring of each candidate transliteration pair is independent of the other. As we have noted, document pairs that contain lots of plausible transliteration pairs should be viewed as more plausible document pairs; at the same time, in such a situation we should also trust the putative transliteration pairs more. Thus these document pairs and transliteration pairs mutually &quot;reinforce&quot; each other, and this can be exploited to further optimize our transliteration scores by allowing transliteration pairs to propagate their scores to each other according to their co-occurrence strengths.</Paragraph> <Paragraph position="1"> Formally, suppose the current generation of transliteration scores are (ei,ci,wi) i = 1,...,n, where (ei,ci) is a distinct pair of English and Chinese names. Note that although for any i negationslash= j, we have (ei,ci) negationslash= (ej,cj), it is possible that ei = ej or ci = cj for some i negationslash= j. wi is the transliteration score of (ei,ci).</Paragraph> <Paragraph position="2"> These pairs along with their co-occurrence relation computed based on our comparable corpora can be formally represented by a graph as shown in Figure 2. In such a graph, a node represents (ei,ci,wi). An edge between (ei,ci,wi) and (ej,cj,wj) is constructed iff (ei,ci) and (ej,cj) co-occur in a certain document pair (Et,Ct), i.e.</Paragraph> <Paragraph position="3"> there exists a document pair (Et,Ct), such that ei,ej [?] Et and ci,cj [?] Ct. Given a node (ei,ci,wi), we refer to all its directly-connected nodes as its &quot;neighbors&quot;. The documents do not appear explicitly in the graph, but they implicitly affect the graph's topology and the weight of each edge. Our idea of score propagation can now be formulated as the following recursive equation for</Paragraph> <Paragraph position="5"> and cooccurence relations.</Paragraph> <Paragraph position="6"> updating the scores of all the transliteration pairs.</Paragraph> <Paragraph position="8"> where w(k)i is the new score of the pair (ei,ci) after an iteration, while w(k[?]1)i is its old score before updating; a [?] [0,1] is a parameter to control the overall amount of propagation (when a = 1, no propagation occurs); P(j|i) is the conditional probability of propagating a score from node (ej,cj,wj) to node (ei,ci,wi).</Paragraph> <Paragraph position="9"> We estimate P(j|i) in two different ways: 1) The number of cooccurrences in the whole collection (Denote as CO). P(j|i) = C(i,j)summationtext j' C(i,j '), where C(i,j) is the cooccurrence count of (ei,ci) and (ej,cj); 2) A mutual information-based method (Denote as MI). P(j|i) = MI(i,j)summationtext j' MI(i,j '), where MI(i,j) is the mutual information of (ei,ci) and (ej,cj). As we will show, the CO method works better. Note that the transition probabilities between indirect neighbors are always 0. Thus prop- null agation only happens between direct neighbors. This formulation is very similar to PageRank, a link-based ranking algorithm for Web retrieval (Brin and Page, 1998). However, our motivation is propagating scores to exploit cooccurrences, so we do not necessarily want the equation to converge. Indeed, our results show that although the initial iterations always help improve accuracy, too many iterations actually would decrease the performance. null</Paragraph> </Section> </Section> class="xml-element"></Paper>