File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/j98-4003_abstr.xml

Size: 7,081 bytes

Last Modified: 2025-10-06 13:49:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-4003">
  <Title>Machine Transliteration</Title>
  <Section position="2" start_page="0" end_page="600" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> One of the most frequent problems translators must deal with is translating proper names and technical terms. For language pairs like Spanish/English, this presents no great challenge: a phrase like Antonio Gil usually gets translated as Antonio Gil. However, the situation is more complicated for language pairs that employ very different alphabets and sound systems, such as Japanese/English and Arabic/English. Phonetic translation across these pairs is called transliteration. We will look at Japanese/English transliteration in this article.</Paragraph>
    <Paragraph position="1"> Japanese frequently imports vocabulary from other languages, primarily (but not exclusively) from English. It has a special phonetic alphabet called katakana, which is used primarily (but not exclusively) to write down foreign names and loanwords. The katakana symbols are shown in Figure 1, with their Japanese pronunciations. The two symbols shown in the lower right corner ( --, 7 ) are used to lengthen any Japanese vowel or consonant.</Paragraph>
    <Paragraph position="2"> To write a word like golfbag in katakana, some compromises must be made. For example, Japanese has no distinct L and R sounds: the two English sounds collapse onto the same Japanese sound. A similar compromise must be struck for English H and F. Also, Japanese generally uses an alternating consonant-vowel structure, making it impossible to pronounce LFB without intervening vowels. Katakana writing is a syllabary rather than an alphabet--there is one symbol for ga (~&amp;quot;), another for gi ( 4 ~&amp;quot; ), another for gu ( Y&amp;quot; ), etc. So the way to write golfbag in katakana is ~',,t, 7 ~,&lt; 7 Y', roughly pronounced go-ru-hu-ba-ggu. Here are a few more examples:  * USC/Inforrnation Sciences Institute, Marina del Rey, CA 90292 and USC/Computer Science Department, Los Angeles, CA 90089 t USC/Computer Science Department, Los Angeles, CA 90089 (~) 1998 Association for Computational Linguistics Computational Linguistics Volume 24, Number 4 T (a) ~ (ka) ~-(sa) ~ (ta) ~(na) C/&amp;quot; (ha) ~(ma) ~ (ra) (i) ~ (k+-) ~ (shi) Y-(ch+-) ~ (ni) a (hi) ~ (mi) ~ (ri) (u) ~ (ku) X (su) 7 (tsu) % (nu) 7 (hu) ~ (mu) 2~ (ru) :n(e) ~(ke) ~ (se) ~ (te) ~ (he) ~-(he) fl (me) , ~ (re) M- (o) = (ko) Y (so) b (to) \] (no) * (ho) ~ (mo) ~ (ro) -~ (ba) 2&amp;quot;(ga) -&lt; (pa) -Y(za) ~(da) T (a) -V (ya) ~ (ya) (bi) @'(gi) ff (pi) ~ (ji) Y(de) 4 (i) ~ (yo) ~ (yo) Y (bu) ~(gu) ~ (pu) X'(zu) F (do) ~ (u) :~(yu) ~ (yu) -&lt;(be) ~(ge) ~ (pe) ~'(ze) ~ (n) ~ (e) ~ (v) (bo) ~(go) ~:(po) / (zo) ~'(chi) ~ (o) V (wa) -- null ramp lamp casual fashion team leader ~yT&amp;quot; ?y~ ~J=TJ~7~y ~--~--~'-(ranpu) (ranpu) (kaj yuaruhas shyon) (chifmuriidaa) Notice how the transliteration is more phonetic than orthographic; the letter h in Johnson does not produce any katakana. Also, a dot-separator (,) is used to separate words, but not consistently. And transliteration is clearly an information-losing operation: ranpu could come from either lamp or ramp, while aisukuriimu loses the distinction between ice cream and I scream.</Paragraph>
    <Paragraph position="3"> Transliteration is not trivial to automate, but we will be concerned with an even more challenging problem--going from katakana back to English, i.e., back-transliteration. Human translators can often &amp;quot;sound out&amp;quot; a katakana phrase to guess an appropriate translation. Automating this process has great practical importance in Japanese/English machine translation. Katakana phrases are the largest source of text phrases that do not appear in bilingual dictionaries or training corpora (a.k.a. &amp;quot;notfound words&amp;quot;), but very little computational work has been done in this area. Yamron et al. (1994) briefly mention a pattern-matching approach, while Arbabi et al. (1994) discuss a hybrid neural-net/expert-system approach to (forward) transliteration. The information-losing aspect of transliteration makes it hard to invert. Here are some problem instances, taken from actual newspaper articles: ? ? ? (aasudee) (robaato shyoon renaado) (masutaazutoonamento)  Knight and Graehl Machine Transliteration English translations appear later in this article.</Paragraph>
    <Paragraph position="4"> Here are a few observations about back-transliteration that give an idea of the difficulty of the task: * Back-transliteration is less forgiving than transliteration. There are many ways to write an English word like switch in katakana, all equally valid, but we do not have this flexibility in the reverse direction. For example, we cannot drop the t in switch, nor can we write arture when we mean archer. Forward-direction flexibility wreaks havoc with dictionary-based solutions, because no dictionary will contain all katakana variants.</Paragraph>
    <Paragraph position="5"> * Back-transliteration is harder than romanization. A romanization scheme simply sets down a method for writing a foreign script in roman letters.</Paragraph>
    <Paragraph position="6"> For example, to romanize T y &amp;quot;Y ~, we look up each symbol in Figure 1 and substitute characters. This substitution gives us (romanized) anj ira, but not (translated) angela. Romanization schemes are usually deterministic and invertible, although small ambiguities can arise. We discuss some wrinkles in Section 3.4.</Paragraph>
    <Paragraph position="7"> * Finally, not all katakana phrases can be &amp;quot;sounded out&amp;quot; bv backtransliteration. Some phrases are shorthand, e.g., V- 7 deg ~ (waapuro) should be translated as word processing. Others are onomatopoetic and difficult to translate. These cases must be solved by techniques other than those described here.</Paragraph>
    <Paragraph position="8"> The most desirable feature of an automatic back-transliterator is accuracy. If possible, our techniques should also be: * portable to new language pairs like Arabic/English with minimal effort, possibly reusing resources.</Paragraph>
    <Paragraph position="9"> * robust against errors introduced by optical character recognition.</Paragraph>
    <Paragraph position="10"> * relevant to speech recognition situations in which the speaker has a heavy foreign accent.</Paragraph>
    <Paragraph position="11"> * able to take textual (topical/syntactic) context into account, or at least be able to return a ranked list of possible English translations.</Paragraph>
    <Paragraph position="12"> Like most problems in computational linguistics, this one requires full world knowledge for a 100% solution. Choosing between Katarina and Catalina (both good guesses for ~ ~ ~J ~- ) might even require detailed knowledge of geography and figure skating. At that level, human translators find the problem quite difficult as well, so we only aim to match or possibly exceed their performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML