File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3248_metho.xml
Size: 11,515 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3248"> <Title>A New Approach for English-Chinese Named Entity Alignment</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 NE Alignment with a Maximum Entropy </SectionTitle> <Paragraph position="0"> Model Without relying on syntactic knowledge from either the English side or the Chinese side, we find there are several valuable features that can be used for Named Entity alignment. Considering the advantages of the maximum entropy model (Berger et al., 1996) to integrate different kinds of features, we use this framework to handle our problem.</Paragraph> <Paragraph position="1"> Suppose the source English NE</Paragraph> <Paragraph position="3"> words and the candidate Chinese NE</Paragraph> <Paragraph position="5"> Chinese characters. Suppose also that we have M feature functions .,...,1),,( Mmneneh ecm = For each feature function, we have a model parameter</Paragraph> </Section> <Section position="5" start_page="1" end_page="11" type="metho"> <SectionTitle> .,...,1, Mm </SectionTitle> <Paragraph position="0"> m =l The alignment probability can be defined as follows (Och and Ney, 2002):</Paragraph> <Paragraph position="2"> The decision rule to choose the most probable aligned target NE of the English NE is (Och and Ney, 2002):</Paragraph> <Paragraph position="4"> In our approach, considering the characteristics of NE translation, we adopt 4 features: translation score, transliteration score, the source NE and target NE's co-occurrence score, and distortion score for distinguishing identical NEs in the same sentence. Next, we discuss these four features in detail.</Paragraph> <Section position="1" start_page="1" end_page="11" type="sub_section"> <SectionTitle> 3.1 Feature Functions </SectionTitle> <Paragraph position="0"> It is important to consider the translation probability between words in English NE and characters in Chinese NE. When processing Chinese sentence without segmentation, word here refers to single Chinese character.</Paragraph> <Paragraph position="1"> The translation score here is used to represent how close an NE pair is based on translation probabilities. Supposing the source English NE e ne consists of n English words, }...,{ 21 ne eeene = and the candidate Chinese NE c ne is composed of m Chinese characters, }...,{ 21 mc cccne = , we can get the translation score of these two bilingual NEs based on the translation probability between e</Paragraph> <Paragraph position="3"> )|(),( (3.3) Given a parallel corpus aligned at the sentence level, we can achieve the translation probability</Paragraph> <Paragraph position="5"> ecp via word alignments with IBM Model 1 (Brown et al., 1993). Without word segmentation, we have to calculate every possible candidate to determine the most probable alignment, which will make the search space very large. Therefore, we conduct pruning upon the whole search space. If there is a score jump between two adjacent characters, the candidate will be discarded. The scores between the candidate Chinese NEs and the source English NE are calculated via this formula as the value of this feature.</Paragraph> <Paragraph position="6"> Although in theory, translation scores can build up relations within correct NE alignments, in practice this is not always the case, due to the characteristics of the corpus. This is more obvious when we have sparse data. For example, most of the person names in Named Entities are sparsely distributed in the corpus and not repeated regularly. Besides that, some English NEs are translated via transliteration (Lee and Chang, 2003; Al-Onaizan and Knight, 2002; Knight and Graehl, 1997) instead of semantic translation. Therefore, it is fairly important to make transliteration models. Given an English Named Entity e,</Paragraph> <Paragraph position="8"> can be described with Formula (3.4) (For simplicity of denotation, we here use e and c to represent English NE and Chinese NE instead of Since there are more than 6k common-used Chinese characters, we need a very large training corpus to build the mapping directly between English words and Chinese characters. We adopt a romanization system, Chinese PinYin, to ease the transformation. Each Chinese character corresponds to a Chinese PinYin string. And the probability from a Chinese character to PinYin string is 1)|( [?]crP , except for polyphonous characters. Thus we have:</Paragraph> <Paragraph position="10"> Our problem is: Given both English NE and candidate Chinese NEs, finding the most probable alignment, instead of finding the most probable Chinese translation of the English NE. Therefore unlike previous work (Lee and Chang, 2003; Huang et al., 2003) in English-Chinese transliteration models, we transform each candidate Chinese NE to Chinese PinYin strings and directly train a PinYin-based language model with a separate English-Chinese name list consisting of 1258 name pairs to decode the most probable PinYin string from English NE.</Paragraph> <Paragraph position="11"> To find the most probable PinYin string from English NE, we rewrite Formula (3.5) as the following:</Paragraph> <Paragraph position="13"> Chinese PinYin substring.</Paragraph> <Paragraph position="14"> For example, we have English NE &quot;Richard&quot; and its candidate Chinese NE &quot;Li Cha De &quot;. Since both the channel model and language model are PinYin based, the result of Viterbi decoding is from &quot;Ri char d&quot; to &quot;Li Cha De&quot;. We transform &quot;Li Cha De &quot; to the PinYin string &quot;Li Cha De&quot;. Then we compare the similarity based on the PinYin string instead of with Chinese characters directly. This is because when transliterating English NEs into Chinese, it is very flexible to choose which character to simulate the pronunciation, but the PinYin string is relatively fixed.</Paragraph> <Paragraph position="15"> For every English word, there exist several ways to partition it into syllables, so here we adopt a dynamic programming algorithm to decode the English word into a Chinese PinYin sequence.</Paragraph> <Paragraph position="16"> Based on the transliteration string of the English NE and the PinYin string of the original candidate Chinese NE, we can calculate their similarity with the XDice coefficient (Brew and McKelvie, 1996).</Paragraph> <Paragraph position="17"> This is a variant of Dice coefficient which allows &quot;extended bigrams&quot;. An extended bigram (xbig) is formed by deleting the middle letter from any three-letter substring of the word in addition to the original bigrams.</Paragraph> <Paragraph position="18"> Suppose the transliteration string of the English NE and the PinYin string of the candidate Chinese framework above is only applied on foreign names.</Paragraph> <Paragraph position="19"> For Chinese person name translation, the surface English strings are exactly Chinese person names' PinYin strings. To deal with the two situations, let sur e denote the surface English string, the final transliteration score is defined by taking the maximum value of the two XDice coefficients:</Paragraph> <Paragraph position="21"> This formula does not differentiate foreign person names and Chinese person names, and foreign person names' transliteration strings or Chinese person names' PinYin strings can be handled appropriately. Besides this, since the English string and the PinYin string share the same character set, our approach can also work as an alternative if the transliteration decoding fails.</Paragraph> <Paragraph position="22"> For example, for the English name &quot;Cuba&quot;, the alignment to a Chinese NE should be &quot;Gu Ba &quot;. If the transliteration decoding fails, its PinYin string, &quot;Guba&quot;, still has a very strong relation with the surface string &quot;Cuba&quot; via the XDice coefficient. This can make the system more powerful.</Paragraph> <Paragraph position="23"> Another approach is to find the co-occurrences of source and target NEs in the whole corpus. If both NEs co-occur very often, there exists a big chance that they align to each other. The knowledge acquired from the whole corpus is an extra and valuable feature for NE alignment. We calculate the co-occurrence score of the source English NE and the candidate Chinese NE with the following formula: When translating NEs across languages, we notice that the difference of their positions is also a good indication for determining their relation, and this is a must when there are identical candidates in the target language. The bigger the difference is, the less probable they can be translations of each other. Therefore, we define the distortion score between the source English NE and the candidate Chinese NE as another feature.</Paragraph> <Paragraph position="24"> Suppose the index of the start position of the English NE is i, and the length of the English sentence is m. We then have the relative position of where ABS means the absolute value. If there are multiple identical candidate Chinese NEs at different positions in the target language, the one with the largest distortion score will win.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2 Bootstrapping with the MaxEnt Model </SectionTitle> <Paragraph position="0"> To apply the maximum entropy model for NE alignment, we process in two steps: selecting the NE candidates and training the maximum entropy model parameters.</Paragraph> <Paragraph position="1"> To get an NE alignment with our maximum entropy model, we first use NLPWIN (Heidorn, 2000) to identify Named Entities in English. For each word in the recognized NE, we find all the possible translation characters in Chinese through the translation table acquired from IBM Model 1. Finally, we have all the selected characters as the &quot;seed&quot; data. With an open-ended window for each seed, all the possible sequences located within the window are considered as possible candidates for NE alignment. Their lengths range from 1 to the empirically determined length of the window.</Paragraph> <Paragraph position="2"> During the candidate selection, the pruning strategy discussed above is applied to reduce the search space.</Paragraph> <Paragraph position="3"> For example, in Figure 1, if &quot;China&quot; only has a translation probability over the threshold value with &quot;Zhong &quot;, the two seed data are located with the index of 0 and 4. Supposing the length of the window to be 3, all the candidates around the seed data including &quot;Zhong Guo &quot;, with the length ranging from 1 to 3, are selected.</Paragraph> <Paragraph position="4"> With the four feature functions defined in Section 3.1, for each identified NE in English, we calculate the feature scores of all the selected Chinese NE candidates.</Paragraph> <Paragraph position="5"> To achieve the most probable aligned Chinese NE, we use the published package YASMET to conduct parameter training and re-ranking of all the NE candidates. YASMET requires supervised learning for the training of the maximum entropy model. However, it is not easy to acquire a large annotated training set. Here bootstrapping is used to help the process. Figure 2 gives the whole procedure for parameter training.</Paragraph> </Section> </Section> class="xml-element"></Paper>