File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1024_metho.xml
Size: 12,594 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1024"> <Title>Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Language Identification of Names </SectionTitle> <Paragraph position="0"> Given a name in English for which we do not have a translation in a bilingual English-Japanese dictionary, we first have to decide whether the name is of Japanese, Chinese, Korean or some European origin. In order to determine the origin of names, we created a language identifier for names, using a trigram language identification 3 We have applied the same technique to Chinese and Korean names, though the details are not presented here. method (Cavner and Trenkle, 1994). During training, for Chinese names, we used a list of 11,416 Chinese names together with their frequency information4. For Japanese names, we used the list of 83,295 Japanese names found in ENAMDICT5. For English names, we used the list of 88,000 names found at the US. Census site6.</Paragraph> <Paragraph position="1"> (We did not obtain any training data for Korean names, so origin identification for Korean names is not available.) Each list of names7 was converted into trigrams; the trigrams for each list were then counted and normalized by dividing the count of the trigram by the number of all the trigrams. To identify a name as Chinese, Japanese or English (Other, actually), we divide the name into trigrams, and sum up the normalized trigram counts from each language. A name is identified with the language which provides the maximum sum of normalized trigrams in the word. Table 1 presents the results of this simple trigram-based language identifier over the list of names used for training the trigrams.</Paragraph> <Paragraph position="2"> The following are examples of identification errors: Japanese names recognized as English, e.g., aa, abason, abire, aebakouson; Japanese names recognized as Chinese, e.g., abeseimei, abei, adan, aden, afun, agei, agoin. These errors show that the language identifier can be improved, possibly by taking into account language-specific features, such as the number of syllables in a name. For origin detection of Japanese names, the current method works well enough for a first pass with an accuracy of 92%.</Paragraph> <Paragraph position="3"> names are found both in the Japanese name list and in the Chinese name list; 1529 names appear in the Japanese name list and the US Census name list; and 379 names are found both in the Chinese name list and the US Census list.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 English-Japanese Back-Transliteration </SectionTitle> <Paragraph position="0"> Once the origin of a name in Latin scripts is identified, we apply language-specific rules for back-transliteration. For non-Asian names, we use a katakana transliteration method as described in (Qu et al., 2003). For Japanese and Chinese names, we use the method described below. For example, &quot;koizumi&quot; is identified as a name of Japanese origin and thus is back-transliterated to Japanese using Japanese specific phonetic mappings between romanji and kanji characters.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Romanji-Kanji Mapping </SectionTitle> <Paragraph position="0"> To obtain the mappings between kanji characters and their romanji representations, we used the Unihan database, prepared by the Unicode Consortium 8 . The Unihan database, which currently contains 54,728 kanji characters found in Chinese, Japanese, and Korean, provides rich information about these kanji characters, such as the definition of the character, its values in different encoding systems, and the pronunciation(s) of the character in Chinese (listed under the feature kMandarin in the Unihan database), in Japanese (both the On reading and the Kun reading 9 : kJapaneseKun and kJapaneseOn), and in Korean (kKorean). For example, for the kanji character a0 , coded with Unicode hexadecimal character 91D1, the Unihan database lists 49 features; we list below its pronunciations in Japanese, Chinese, and Korean:</Paragraph> <Paragraph position="2"> In the example above, a0 is represented in its Unicode scalar value in the first column, with a feature name in the second column and the values of the feature in the third column. The Japanese Kun reading of a0 is KANE, while the Japanese On readings of a0 is KIN and KON.</Paragraph> <Paragraph position="3"> From the Unicode database, we construct mappings between Japanese readings of a character in romanji and the kanji characters in its Unicode representation. As kanji characters in Japanese names can have either the Kun reading or the On into the Japanese writing system, two methods of transcription were used. One is called &quot;on-yomi&quot; (i.e., On reading), where the Chinese sounds of the characters were adopted for Japanese words. The other method is called &quot;kun-yomi&quot; (i.e., Kun reading), where a kanji character preserved its meaning in Chinese, but was pronounced using the Japanese sounds.</Paragraph> <Paragraph position="4"> reading, we consider both readings as candidates for each kanji character. The mapping table has a total of 5,525 entries. A typical mapping is as follows:</Paragraph> <Paragraph position="6"> in which the first field specifies a pronunciation in romanji, while the rest of the fields specifies the possible kanji characters into which the pronunciation can be mapped.</Paragraph> <Paragraph position="7"> There is a wide variation in the distribution of these mappings. For example, kou can be the pronunciation of 670 kanji characters, while the sound katakumi can be mapped to only one kanji character.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Romanji Name Back-Transliteration </SectionTitle> <Paragraph position="0"> In theory, once we have the mappings between romanji characters and the kanji characters, we can first segment a Japanese name written in romanji and then apply the mappings to back-transliterate the romanji characters into all possible kanji representations. However, for some segmentation, the number of the possible kanji combinations can be so large as to make the problem computationally intractable. For example, consider the short Japanese name &quot;koizumi.&quot; This name can be segmented into the romanji characters &quot;ko-i-zu-mi&quot; using the Romanji-Kanji mapping table described in section 3.1, but this segmentation then has 182*230*73*49 (over 149 million) possible kanji combinations. Here, 182, 239, 73, and 49 represents the numbers of possible kanji characters for the romanji characters &quot;ko&quot;, &quot;i&quot;, &quot;zu&quot;, and &quot;mi&quot;, respectively. In this study, we present an efficient procedure for back-transliterating romanji names to kanji characters that avoids this complexity. The procedure consists of the following steps: (1) romanji name segmentation, (2) kanji name generation, (3) kanji name filtering via monolingual Japanese corpus, and (4) kanji-romanji combination filtering via WWW. Our procedure relies on filtering using corpus statistics to reduce the hypothesis space in the last three steps. We illustrate the steps below using the romanji name &quot;koizumi&quot; (a1a3a2a5a4 .</Paragraph> <Paragraph position="1"> With the romanji characters from the Romanji-Kanji mapping table, we first segment a name recognized as Japanese into sequences of romanji characters. Note that a greedy segmentation method, such as the left-to-right longest match method, often results in segmentation errors. For example, for &quot;koizumi&quot;, the longest match segmentation method produces segmentation &quot;koizu-mi&quot;, while the correct segmentation is &quot;koizumi&quot;. null Motivated by this observation, we generate all the possible segmentations for a given name. The possible segmentations for &quot;koizumi&quot; are:</Paragraph> <Paragraph position="3"> Using the same Romanji-Kanji mapping table, we obtain the possible kanji combinations for a segmentation of a romanji name produced by the previous step. For the segmentation &quot;ko-izumi&quot;, we have a total of 546 (182*3) combinations (we use the Unicode scale value to represent the kanji characters and use spaces to separate them):</Paragraph> <Paragraph position="5"> ......</Paragraph> <Paragraph position="6"> We do not produce all possible combinations. As we have discussed earlier, such a generation method can produce so many combinations as to make computation infeasible for longer segmentations. To control this explosion, we eliminate unattested combinations using a bigram model of the possible kanji sequences in Japanese. From the Japanese evaluation corpus of the NTCIR-4 CLIR track 10 , we collected bigram statistics by first using a statistical part-of-speech tagger of Japanese (Qu et al., 2004). All valid Japanese terms and their frequencies from the tagger output were extracted. From this term list, we generated kanji bigram statistics (as well as an attested term list used below in step 3). With this bigram-based model, our hypothesis space is significantly reduced. For example, with the segmentation &quot;ko-i-zu-mi&quot;, even though &quot;ko-i&quot; can have 182*230 possible combinations, we only retain the 42 kanji combinations that are attested in the corpus.</Paragraph> <Paragraph position="7"> Continuing with the romanji segments &quot;i-zu&quot;, we generate the possible kanji combinations for &quot;i-zu&quot; that can continue one of the 42 candidates for &quot;koi&quot;. This results in only 6 candidates for the segments &quot;ko-i-zu&quot;.</Paragraph> <Paragraph position="8"> Lastly, we consider the romanji segments &quot;zumi&quot;, and retain with only 4 candidates for the segmentation &quot;ko-i-zu-mi&quot; whose bigram sequences are attested in our language model:</Paragraph> <Paragraph position="10"> Thus, for the segmentation &quot;ko-i-zu-mi&quot;, the bigram-based language model effectively reduces the hypothesis space from 182*230*73*49 possible kanji combinations to 4 candidates. For the other alternative segmentation &quot;koi-zu-mi&quot;, no candidates can be generated by the language model.</Paragraph> <Paragraph position="11"> In this step, we use a monolingual Japanese corpus to validate whether the kanji name candidates generated by step (2) are attested in the corpus. Here, we simply use Japanese term list extracted from the segmented NTCIR-4 corpus created for the previous step to filter out unattested kanji combinations. For the segmentation &quot;koizumi&quot;, the following kanji combinations are attested in the corpus (preceded by their frequency in the corpus):</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4167 a0a2a1 koizumi 16 a3a4a1 koizumi 4 a5a4a1 koizumi </SectionTitle> <Paragraph position="0"> None of the four kanji candidates from the alternate segmentation &quot;ko-i-zu-mi&quot; is attested in the corpus. While step 2 filters out candidates using bigram sequences, step 3 uses corpus terms in their entirety to validate candidates.</Paragraph> <Paragraph position="1"> Here, we take the corpus-validated kanji candidates (but for which we are not yet sure if they correspond to the same reading as the original Japanese name written in romanji) and use the Web to validate the pairings of kanji-romanji combinations (e.g., a1 a2 AND koizumi). This is motivated by two observations. First, in contrast to monolingual corpus, Web pages are often mixedlingual. It is often possible to find a word and its translation on the same Web pages. Second, person names and specialized terminology are among the most frequent mixed-lingual items. Thus, we would expect that the appearance of both representations in close proximity on the same pages gives us more confidence in the kanji representations. For example, with the Google search engine, all three kanji-romanji combinations for &quot;koizumi&quot; are attested:</Paragraph> </Section> class="xml-element"></Paper>