File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/p04-1024_concl.xml

Size: 3,117 bytes

Last Modified: 2025-10-06 13:54:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1024">
  <Title>Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation</Title>
  <Section position="9" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this study, we have examined a solution to a previously little treated problem of transliterating CJK names written in Latin scripts back into their ideographic representations. The solution involves first identifying the origins of the CJK names and then back-transliterating the names to their respective ideographic representations with language-specific sound-to-character mappings.</Paragraph>
    <Paragraph position="1"> We have demonstrated that a simple trigram-based language identifier can serve adequately for identifying names of Japanese origin. During back-transliteration, the possibilities can be massive due to the large number of mappings between a Japanese sound and its kanji representations. To reduce the complexity, we apply a three-tier filtering process which eliminates most incorrect candidates, while still achieving an F measure of 0.38 on a test set of given names, and an F measure of 0.56 on a test of surnames. The three filtering steps involve using a bigram model derived from a large segmented Japanese corpus, then using a list of attested corpus terms from the same corpus, and lastly using the whole Web as a corpus. The Web is used to validate the back-transliterations using statistics of pages containing both the candidate kanji translation as well as the original romanji name.</Paragraph>
    <Paragraph position="2"> Based on the results of this study, our future work will involve testing the effectiveness of the current method in real CLIR applications, applying the method to other types of proper names and other language pairs, and exploring new methods for improving precision and recall for romanji name back-transliteration. In cross-language applications such as English to Japanese retrieval, dealing with a romaji name that is missing in the bilingual lexicon should involve (1) identifying the origin of the name for selecting the appropriate language-specific mappings, and (2) automatically generating the back-transliterations of the name in the right orthographic representations (e.g., Katakana representations for foreign Latin-origin names or kanji representations for native Japanese names). To further improve precision and recall, one promising technique is fuzzy matching (Meng et al, 2001) for dealing with phonological transformations in name generation that are not considered in our current approach (e.g., &amp;quot;matsuda&amp;quot; vs &amp;quot;matsuta&amp;quot;). Lastly, we will explore whether the proposed romanji to kanji back-transliteration approach applies to other types of names such as place names and study the effectiveness of the approach for back-transliterating romanji names of Chinese origin and Korean origin to their respective kanji representations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML