File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1102_intro.xml
Size: 7,491 bytes
Last Modified: 2025-10-06 14:02:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1102"> <Title>Detecting Transliterated Orthographic Variants via Two Similarity Metrics</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper discusses a detection method for transliterated orthographic variants of foreign words. Transliteration of foreign words causes orthographic variants because there are several conditions required for transliterating. One may person transliterate to approximate pronunciation, whereas another one may conduct transliteration based on spelling. For example, the English word report can be transliterated into two Japanese words, a0a2a1a4a3a6a5 (ripooto) and a7 a1a4a3a8a5 (repooto). The former ripooto is based on an approximation of its pronunciation, while repooto is transliterated from its spelling.</Paragraph> <Paragraph position="1"> In addition, several source languages can be transliterated. For instance, the English word virus corresponds to the Japanese words: a9a11a10a13a12a15a14 (uirusu) from Latin, a16 a3 a12a17a14 (biirusu) and</Paragraph> <Paragraph position="3"> also possible as transliterations that approximate the English pronunciation. Moreover, some foreign words end up in different forms in Japanese because of variation in English pronunciation; e.g., between British and American. For example, the English word body corresponds to two words: a26 a27 a19 (bodi) from British and a27 a19 (badi) from American.</Paragraph> <Paragraph position="4"> One may think that if back-transliteration were done precisely, those variants would be backtransliterated into one word, and they would be recognized as variants. However, back-transliteration is known to be a very dif cult task(Knight and Graehl, 1997).</Paragraph> <Paragraph position="5"> Not only Japanese but any language that has a phonetic spelling has this problem of transliterated orthographic variants. For example, English has variants for a Chinese proper noun as Shanhaiguan, Shanhaikwan, or Shanhaikuan.</Paragraph> <Paragraph position="6"> Nowadays, it is well recognized that orthographic variant correction is an important processing step for achieving high performance in natural language processing. In order to achieve robust and reliable processing, we have to use many language resources: many types of corpora, dictionaries, thesauri, and so on. Orthographic variants cause many mismatches at any stage of natural language processing. In addition, not only orthographic variants, but also misspelled words tend to slip into a corpus. These words boost the perplexity of the corpus, and worsen the data sparseness problem.</Paragraph> <Paragraph position="7"> To date, several studies have tried to cope with this orthographic variant problem; however, they considered the problem in a relatively clean corpus that was well organized by natives of the target language. As with orthographic variants, misspelled words cause mismatches, and we have to detect not only predictable orthographic variants but also misspelled variants. In addition, it is very hard to detect orthographic variants caused by misspelling with ordinary rule-based methods, because preparing such rules for misspellings that might be written is an unrealistic approach.</Paragraph> <Paragraph position="8"> If a corpus includes texts that were written by non-natives of the language, orthographic variants that are misspelled will likely be increased because non-natives have a limited vocabulary in that language. null We propose a robust detection method for transliterated orthographic variants in a Japanese corpus. The method is marked by a combination of different types of similarities. It is not such a dif cult task to detect simple misspelled words, because a large dictionary would tell us whether the word is common as long as we prepare a large enough dictionary. However, it often occurs that a misspelled word is recognized as a common word. For example, in English, someone may mistype from as form, and string information will tell us nothing because both words are common. Therefore, we use contextual information to detect this kind of mistyping.</Paragraph> <Paragraph position="9"> 2 Transliteration for foreign words in Japanese: katakana Japanese features three types of characters (katakana, hiragana, and kanji (Chinese characters)). Katakana is a syllabary which is used mostly to write Western loanwords, onomatopoeic words, names of plants and animals, non-Japanese personal and place names, for emphasis, and for slang, while hiragana is an ordinary syllabary.</Paragraph> <Paragraph position="10"> Katakana cannot express the precise pronunciation of loanwords, because the katakana transliteration of a loanword is an attempt to approximate the pronunciation of its etymon (the foreign word from which it is derived). Thus, katakana orthography is often irregular, thus the same word may be written in multiple ways. Although there are general guidelines for loanword orthography, in practice there is considerable variation. In addition, recent years have seen an enormous increase in katakana use, not only in technical terminology, but in common daily usage.</Paragraph> <Paragraph position="11"> To date, several detecting methods have been proposed for English and other languages. One may think that such methods can be applied to Japanese and work well. However, most katakana characters correspond to two phonemes, and this causes several problems. Due to the correspondence between katakana characters and phonemes, it is easy to imagine that the application would require tangled procedures.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Romanization </SectionTitle> <Paragraph position="0"> We use Japanese romanization for katakana characters to capture its pronunciation because there are several katakana characters for which the pronunciation is the same. For example, a28 (possible romanization: zi/ji) and a29 (possible romanization: di/zi) are not differentiated in pronunciation. In addition, there are several katakana expressions that have very similar pronunciations.</Paragraph> <Paragraph position="2"> they cause these variants. Naturally we can use katakana characters to compare two strings, but employing katakana characters makes the comparing procedure cumbersome and complicated. To avoid the complicated comparing procedure for katakana expressions, we use Japanese romanization. null We used a system of Japanese romanization based on ISO 3602, the romanization of which is based on the Kunreisiki system. There are two major systems for Japanese romanization. One is based on the theory of the Kunrei (Kunreisiki) system. The other one is the Hepburn system, which is widely used in English-speaking communities.</Paragraph> <Paragraph position="3"> The Kunreisiki system was designed to represent kana morphology accurately. For example, the katakana character a35 is written as si in the Kunreisiki system while it is written as shi in the Hepburn system. In this example, the character h that is inserted disturbs simple matching procedures, because most katakana characters correspond to two romanized characters: a consonant and a vowel. Thus we prefer to use a romanization system based on the Kunreisiki system to make the matching procedure simple.</Paragraph> </Section> </Section> class="xml-element"></Paper>