File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1501_metho.xml
Size: 20,915 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1501"> <Title>Learning Formulation and Transformation Rules for Multilingual Named Entities</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Multilingual Named Entity Corpora </SectionTitle> <Paragraph position="0"> NICT location name corpus which was developed by Ministry of Education of Taiwan in 1995 collected 19,385 foreign location names. Each entry consists of three parts, including foreign location name, Chinese transliteration/translation name, and country name, e.g., (Victoria Fall, &quot;Wei Duo Li Ya Pu Bu &quot; (wei duo li ya pu bu), South Africa), (Little Rocky Mountains, &quot;Xiao Luo Ji Shan Mo &quot; (xiao luo ji shan mo), USA), etc. The foreign location names are in English alphabet. Some location names denoting the same city have more than one form like Firenze and Florence for a famous Italian city. The former is an Italian name and the latter is its English name. They correspond to two different transliterations in Chinese, respectively, i.e., &quot;Fei Leng Cui &quot; (fei leng cui) and &quot;Fo Luo Lun Si &quot; (fo luo lun si). The pronunciation of the foreign names in NICT corpus is based on Webster's New Geographic Dictionary. The foreign name itself may be a transliteration name. A Japanese city is transliterated in English alphabet, but its corresponding translation name is in Kanji (Hanzi in Japanese). It is hard to capture their relationships except dictionary lookup, so that Japanese location name is out of our discussion.</Paragraph> <Paragraph position="1"> We employ the country field to select the translation/transliteration pairs that we will deal with in this paper. Table 1 summarizes the statistics of NICT corpus based on country tags.</Paragraph> <Paragraph position="2"> CNA personal name and organization corpora are used by news reporters to unify the name transliteration/translation in news stories. There are 50,586 pairs of foreign personal names and Chinese transliteration/translation in persona name corpus. Different from NICT corpus, there do not exist clear cues to identify the nationality of named people. Thus, we could not exclude the Japanese names like &quot;Hayakawa&quot; and the corresponding name &quot;Zao Chuan &quot; (zao chuan) from our discussion automatically. There are 14,658 named organizations in CNA corpus. Some organization names are tagged with the country names to which they belong. For example, &quot;Aachen Technical University&quot; = Ya Ken Ji Shu Da Xue (ya ken ji shu da xue) (Germany). But not all the organization names have such country tags. Comparatively, organization names are longer than the other two named entities. Table 2 shows the statistics of NICT organization name corpus. FL denotes the length of foreign names in words, CL denotes the length of Chinese names in characters, and Count denotes the number of foreign names of the specified length.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Rule Mining </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Frequency-Based Approach with a Bilingual Dictionary </SectionTitle> <Paragraph position="0"> We postulate that a transliterated term is usually an unknown word, i.e., not listed in a lexicon and a translated term often appears in a lexicon. Under this postulation, a translated term occurs more often in a corpus, and comparatively, a transliterated term only appears very few.</Paragraph> <Paragraph position="1"> A simple frequency-based method will compute the frequencies of terms and use them to tell out the transliteration and translation parts in a named entity. Because Chinese has segmentation problem, we start the frequency computation from the foreign name part in a multilingual named entity corpus. The method is sketched as follows. (1) Compute the word frequencies of each word in the foreign name list.</Paragraph> <Paragraph position="2"> (2) Keep those words that appear more than a threshold and appear in a common foreign dictionary (e.g., an English dictionary). These words form candidates of simple keywords.</Paragraph> <Paragraph position="3"> (3) Examine the foreign word list again.</Paragraph> <Paragraph position="4"> Those word strings that are composed of simple keyword candidates are candidates of compound keywords. We find out the compound keyword set by using collocation metric by selecting the most frequently occurring compounds through the well-known elimination of prepositions.</Paragraph> <Paragraph position="5"> (4)Because the experimental corpus is aligned, we can cluster the Chinese name list based on foreign keywords. For each Chinese name cluster, we try to identify the Chinese keyword sets. Here a bilingual dictionary may be consulted.</Paragraph> <Paragraph position="6"> The above algorithm extracts foreign/Chinese keyword sets from a multilingual named entity corpus. In the meantime, formulation rules for foreign names and Chinese counterparts are mined. A complete foreign name and a complete Chinese name are mapped into name-keyword combination.</Paragraph> <Paragraph position="7"> By the way, which method, translation or transliteration, is used is also determined.</Paragraph> <Paragraph position="8"> Take NICT location name corpus as an example. The terms of frequencies greater than 20 include River (He , he), Island (Dao , dao), Lake (Hu , hu), Mountain (Shan , shan), Bay (Wan , wan), Mountain (Feng , feng), Peak (Feng , feng), Islands (Qun Dao , qun dao), Mountains (Shan Mo , shan mo), Cape (Jiao , jiao), City (Cheng , cheng), Range (Ling , ling), Peninsula (Ban Dao , ban dao), Point (Jiao , jiao), Strait (Hai Xia , hai xia), River (Chuan , chuan), Gulf (Wan , wan), Cape (Jia , jia), Pass ( Shan Kou , shan kou), Plateau ( Gao Yuan , gao yuan), Headland (Jia , jia), Harbor (Gang , gang), Sea (Hai , hai), Promontory (Jia , jia), and Hills (Qiu Ling , qui ling). On the one hand, a foreign location keyword, e.g., &quot;Mountain&quot;, may correspond to two Chinese location keywords, e.g., &quot;Shan &quot; (shan) and &quot;Feng &quot; (feng). On the other hand, the same Chinese location keyword &quot;Feng &quot; (feng) can be translated into two English location keywords &quot;Mountain&quot; and &quot;Peak&quot;.</Paragraph> <Paragraph position="9"> Similarly, suffix and prefix for organization names can be extracted from CNA organization name corpus. Some high frequent keywords are shown as follows.</Paragraph> <Paragraph position="10"> (1) Suffix Party (Dang , dang), Association (Xie Hui , xie hui), University (Da Xue , da xue), Co. (Gong Si , gong si), Committee (Wei Yuan Hui , wei yuan hui), Company (Gong Si , gong si), Bank (Yin Xing , yia hang), etc. (2) Prefix International (Guo Ji , guo ji), World (Shi Jie , shi jie), American (Mei Guo , mei guo), National (Quan Guo , quan guo), Japan (Ri Ben , ri ben), National (Guo Jia , guo jia), Asian (Ya Zhou , ya zhou), etc.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Keyword Extraction without a Bilingual Dictionary </SectionTitle> <Paragraph position="0"> At the step (4) of the algorithm in Section 3.1, a bilingual dictionary is required. Because abbreviation is common adopted in translation, dictionary-based approach is hard to capture this phenomenon. A named organization &quot;World Taiwanese Association&quot; which is translated into &quot;Shi Tai Hui &quot; (shi tai hui) is a typical example. The term &quot;World&quot; is translated into an abbreviated term &quot;Shi &quot; (shi) rather than a complete term &quot;Shi Jie &quot; (shi jie). Here another approach without dictionary is proposed. Suppose there are M pairs of (foreign name, Chinese name) in a multilingual named entity corpus. The j th pair, 1 [?] j [?] M, is denoted by</Paragraph> <Paragraph position="2"> is a foreign named entity, and C j is a Chinese named entity. Then some Chinese segment c [?] C j should be associated with some foreign segment e [?] E j . Consider the following examples.</Paragraph> <Paragraph position="3"> (s6) Aletschhorn Mountain = A Li Qi He En Shan (s7) Catalan Mountain = Qia Tai Lan Shan (s8) Cook Strait = Ke Ke Hai Xia (s9) Dover, Strait of =Duo Fo Hai Xia We will align &quot;Shan &quot; (shan) and &quot;Hai Xia &quot; (hai xia) to Mountain and Strait, respectively, from these examples.</Paragraph> <Paragraph position="4"> We further decompose the named entities. If a named entity E</Paragraph> <Paragraph position="6"> }. We then group the pairs collected from the multilingual named entity list and count the frequency for each occurrence. Those pairs with higher frequency denote significant segment pairs. In the above examples, both the two pairs {Mountain, &quot;Shan &quot; (shan)} and {Strait, &quot;Hai Xia &quot; (hai xia)} appear twice, while the other pairs appear only once.</Paragraph> <Paragraph position="7"> All the pairs {e, c} whose frequency > 2 are kept. Two issues have to be addressed. The first is: redundancy which may exist in the pairs of segments should be eliminated carefully. If a pair</Paragraph> <Paragraph position="9"> of tx(t+1)/2 substrings (1 [?] u [?] v [?] t) is at least k.</Paragraph> <Paragraph position="10"> The second is: e may be translated to more than one synonym, which has the same prefix, suffix, or infix. In examples (s10) and (s11), &quot;Association&quot; may be translated into &quot;Xie Hui &quot; (xie hui) and &quot;Lian Yi Hui &quot; (lian yi hui), where &quot;Hui &quot; (hui) is a common suffix of these two translation equivalents, so that its frequency is more than the translation equivalents.</Paragraph> <Paragraph position="11"> (s10) World Trade Association = Shi Jie Mao Yi Xie Hui (s11) North Europe Chinese Association = Bei Ou Hua Ren Lian Yi Hui These two issues may be mixed together to make this problem more challengeable.</Paragraph> <Paragraph position="12"> A metric to deal with the above issues is proposed. The concept is borrowed from tfxidf scheme in information retrieval to measure the alignment of each foreign segment and the possible Chinese translation segments. Assume there are N foreign segments. Term frequency (tf) of a</Paragraph> <Paragraph position="14"> in e denotes the number of occurrences of c</Paragraph> <Paragraph position="16"> is translated to. We prefer to the Chinese translation segment that occur frequently in a specific foreign segment, but rarely in the remainder of foreign segments. Besides, we also prefer the longer Chinese segment, so that the length of a Chinese segment, i.e., |c</Paragraph> <Paragraph position="18"> In this way, we can produce a ranking list of pairs of (foreign segment, Chinese segment), which form multilingual keyword pairs.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Extraction of Transformation Rules </SectionTitle> <Paragraph position="0"> We apply the keyword pairs extracted in the last section to the original named entity list. In (s6)(s9), (mountain, Shan (shan)) and (strait, Hai Xia (hai xia)) are significant keyword pairs. We replace the</Paragraph> <Paragraph position="2"> with patterns g and d , respectively, get the following rules.</Paragraph> <Paragraph position="4"> (s6') and (s7') can be grouped into a rule. As a result, a set of transformation rules can be formulated. From these examples, Chinese location name keyword tends to be located in the rightmost and the remaining part is a transliterated name. On the counterpart, foreign location name keyword tends to be either located in the rightmost, or permuted by some prepositions, comma, and the transliterating part.</Paragraph> </Section> <Section position="4" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.4 Extraction of Keywords at a Distance </SectionTitle> <Paragraph position="0"> The algorithm proposed in Section 3.2 can deal with single keywords and connected compound keywords. Now we will extend it to keywords at a distance. Consider examples (s12)-(s15) at first.</Paragraph> <Paragraph position="1"> (s12) American Podiatric medical Association = Mei Guo Zu Bing Yi Liao Xue Hui (s13) American Public Health Association = Mei Guo Gong Gong Wei Sheng Xue Hui (s14) American Society for Industrial Security = Mei Guo Gong Ye An Quan Xie Hui (s15) American Society of Newspaper Editors = Mei Guo Bao Zhi Bian Ji Ren Xie Hui (s12) and (s13) show that an English compound keyword is separated and so is its corresponding Chinese counterpart. In contrast, the English compound keyword is connected in (s14) and (s15), but the corresponding Chinese translation is separated. The phenomenon appears quite often in the translation of organization names.</Paragraph> <Paragraph position="2"> We introduce a symbol [?] to cope with the distance issue. The original algorithm is modified as follows. A candidate segment c p, q is defined as a string that begins with s p and ends with s</Paragraph> <Paragraph position="4"> instances, respectively. For example, the following shows some additional instances for The scoring method, i.e., formulas (1)-(4), is still applicable for the new algorithm. Nevertheless, the complexity is different. The complexity of the original algorithm is O(m n ), but the complexity of the algorithm here is O(2 m n ), where m is the word count for a foreign named entity and n is the character count for a Chinese named entity. The mining procedure is performed only once, and the mined rules are employed in an application without being recomputed. Thus, the running time is not the major concern of this paper. Besides, the N is bounded in a reasonable small number because the length of a named entity is always rather shorter than that of a sentence. Table 2 shows that 93.88% of foreign names in CNA organization name corpus consist of less than 7 words.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> The algorithm in Section 3.2 was performed on NICT location name corpus, and CNA personal name and organization corpora. With this algorithm, we can produce a ranking list of pairs of (foreign segment, Chinese segment), which form multilingual keyword pairs. Individual foreign segments and Chinese segments are regarded as formulation rules for foreign languages and Chinese, respectively. When both the two segments are considered together, they form a transformation rule. Table 3 summarizes the results using the frequency-based approach without dictionary. For named locations, there are 18,922 records, of which, only 5714 records consist of more than one foreign word. In other words, 13,208 named locations are single words, and they are unique, so that we cannot extract keywords from these words. Total 122 keyword pairs are identified. We classify these keyword pairs into the following types: (hei), Blue = Lan (lan), etc.) (c) the specificity of place or area such as Crystal = Jie Jing (jie jing), Diamond = Zuan Shi (zuan shi), etc.</Paragraph> <Paragraph position="1"> (2) Phoneme transliteration keywords Some morphemes are transliterated such as el = La (la), Dera = De La (de la), Monte = Meng Te (meng te), Los = Luo Si (luo si), Le = Le (le), and so on. Besides, some common transliteration names are also regarded as keywords, e.g., Elizabeth = Yi Li Sha Bai (yi li sha bai), Edward = Ai De Hua (ai de hua), etc. Total 39 terms belong to this type. It occupies 31.97%.</Paragraph> <Paragraph position="2"> (3) Some keywords in type (1) are transliterated. For example, Bay = Bei (Bay), Beach = Bi Qi (bi qi), mountain = Meng Tan (meng tan), Little = Li Te (li te), etc. Total 14 keywords (11.48%) are extracted.</Paragraph> <Paragraph position="3"> Total 230 transformation rules are mined from the NICT location corpus. On the average, a keyword pair corresponds to 1.89 transformation rules. Consider a keyword pair mountain = Shan (shan) as an example. Four transformation rules shown as follows are learned, where a and b denote keywords for foreign language and Chinese, respectively; d is a Chinese transliteration of a foreign fragment g ; the number enclosed in parentheses denotes frequency the rule is applied. (1) ga = db (234) (2) g , a = db (45) (3) g , ag = db (1) (4) gag = db (1) When we apply the 230 transformation rules back to the 5,714 named locations, we can tell out which part is transliterated and which part is translated from 4,262 named locations. It confirms our postulation that a named location is composed of two parts, i.e., one is translated and the other one is transliterated.</Paragraph> <Paragraph position="4"> Comparatively, there are 50,586 personal names in CNA personal names, but only 100 named people are composed of more than one word. The number of keywords extracted is only a few. They are listed below.</Paragraph> <Paragraph position="5"> De = Dai (dai), La = La (la), De La = Dai La (dai la), Van Der = Fan De (fan de), Du = Du (du), David = Da Wei (da wei), Khan = Han (han), Del = Dai (dai), Le = Le (le), Van Den = Fan Deng (fan deng), Di = Di (di) It shows that personal names tend to be transliterated and the CNA personal name corpus is suitable for training the similarity scores among phonetic characters (Lin and Chen, 2002).</Paragraph> <Paragraph position="6"> Finally, we consider the named organizations. There are 14,658 records in CNA organization corpus. Total 12,885 organization names are composed of more than one word. The percentage, 87.90%, is the highest among these three corpora. Besides that, 5,229 keyword pairs are extracted. Most of the keyword pairs are meaning translated. This set is also the largest among the three corpora. Thus, the keyword pairs are too small and too large to find suitable transformation rules for personal names and organization names, respectively.</Paragraph> <Paragraph position="7"> Although the original idea of our algorithm is universal for languages, it should be modified slightly for some specific languages. The following takes German as examples. German words have cases and genders. Most of German words are compound. Consider examples (s16)(s19). null (s16) Neue Osnabruecker = Xin Ao Si Na Bu Lu Bao (s17) Neues Deutschland = Xin De Guo (s18) Bundesbahn = Lian Bang Tie Lu Ju (s19) Bundesbank = Lian Bang Yin Xing The first two examples show the German adjective Neu (New) has different suffixes such as &quot;-e&quot; and &quot;-es&quot; according to the case and gender of the noun. The last two examples suggest that morphological analysis for decompounding the words into meaningful segments is necessary before our algorithm.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Application on CLIR </SectionTitle> <Paragraph position="0"> Cross language information retrieval (CLIR) facilitates using queries in one language to access documents in another. Because named entities are key components of a document, they are usually targets that users are interested in. Figure 1 shows an application of the extracted formulation rules and transformation rules on Chinese-Foreign CLIR.</Paragraph> <Paragraph position="1"> For each document in the Foreign collection, named entities are recognized and classified by using formulation rules. They form important indices for the related documents. When a Chinese query is issued, the system extracts the possible Chinese named entities according to Chinese formulation rules. If keywords are specified in a query, we know the structure and the type of the named entity. The lexical structure tells us which part is translated and which part is transliterated.</Paragraph> <Paragraph position="2"> The backward transliteration method proposed by Lin and Chen (2000, 2002) was followed to select the most similar English named entity and the related documents at the same time. In Lin and Chen's approach, both Chinese name and English candidates will be transformed into a canonical form in terms of International Phonetic Alphabets.</Paragraph> <Paragraph position="3"> Similarity computation among Chinese query term and English candidates are done on phoneme level.</Paragraph> <Paragraph position="4"> That is an expensive operation. Hopefully, the type of Chinese named entity will help to narrow down the number of candidate.</Paragraph> </Section> class="xml-element"></Paper>