File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1017_intro.xml

Size: 5,659 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1017">
  <Title>Machine Transliteration</Title>
  <Section position="2" start_page="0" end_page="128" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Translators must deal with many problems, and one of the most frequent is translating proper names and technical terms. For language pairs like Spanish/English, this presents no great challenge: a phrase like Antonio Gil usually gets translated as Antonio Gil. However, the situation is more complicated for language pairs that employ very different alphabets and sound systems, such as Japanese/English and Arabic/English. Phonetic translation across these pairs is called transliteration. We will look at Japanese/English transliteration in this paper.</Paragraph>
    <Paragraph position="1"> Japanese frequently imports vocabulary from other languages, primarily (but not exclusively) from English. It has a special phonetic alphabet called katakana, which is used primarily (but not exclusively) to write down foreign names and loanwords. To write a word like golf bag in katakana, some compromises must be made. For example, Japanese has no distinct L and R sounds: the two English sounds collapse onto the same Japanese sound.</Paragraph>
    <Paragraph position="2"> A similar compromise must be struck for English H and F. Also, Japanese generally uses an alternating consonant-vowel structure, making it impossible to pronounce LFB without intervening vowels. Katakana writing is a syllabary rather than an alphabet--there is one symbol for ga (~I), another for gi (4e), another for gu (P'), etc. So the way to write gol\]bag in katakana is =~'~ 7 ~ ~, ~, roughly pronounced goruhubaggu. Here are a few more examples: null</Paragraph>
    <Paragraph position="4"> Notice how the transliteration is more phonetic than orthographic; the letter h in Johnson does not produce any katakana. Also, a dot-separator (.) is used to separate words, but not consistently. And transliteration is clearly an information-losing operation: aisukuriimu loses the distinction between ice cream and I scream.</Paragraph>
    <Paragraph position="5"> Transliteration is not trivial to automate, but we will be concerned with an even more challenging problem--going from katakana back to English, i.e., back-transliteration. Automating back-transliteration has great practical importance in Japanese/English machine translation. Katakana phrases are the largest source of text phrases that do not appear in bilingual dictionaries or training corpora (a.k.a. &amp;quot;not-found words&amp;quot;). However, very little computational work has been done in this area; (Yamron et al., 1994) briefly mentions a pattern-matching approach, while (Arbabi et al., 1994) discuss a hybrid neural-net/expert-system approach to (forward) transliteration.</Paragraph>
    <Paragraph position="6"> The information-losing aspect of transliteration makes it hard to invert. Here are some problem instances, taken from actual newspaper articles: 1 ITexts used in ARPA Machine Translation evaluations, November 1994.</Paragraph>
    <Paragraph position="8"> English translations appear later in this paper.</Paragraph>
    <Paragraph position="9"> Here are a few observations about backtransliteration: null * Back-transliteration is less forgiving than transliteration. There are many ways to write an English word like switch in katakana, all equally valid, but we do not have this flexibility in the reverse direction. For example, we cannot drop the t in switch, nor can we write arture when we mean archer.</Paragraph>
    <Paragraph position="10"> * Back-transliteration is harder than romanization, which is a (frequently invertible) transformation of a non-roman alphabet into roman letters. There are several romanization schemes for katakana writing--we have already been using one in our examples. Katakana Writing follows Japanese sound patterns closely, so katakana often doubles as a Japanese pronunciation guide. However, as we shall see, there are many spelling variations that complicate the mapping between Japanese sounds and katakana writing.</Paragraph>
    <Paragraph position="11"> * Finally, not all katakana phrases can be &amp;quot;sounded out&amp;quot; by back-transliteration. Some phrases are shorthand, e.g., r\] _ 7&amp;quot; ~ (uaapuro) should be translated as word processing. Others are onomatopoetic and difficult to translate. These cases must be solved by techniques other than those described here.</Paragraph>
    <Paragraph position="12"> The most desirable feature of an automatic backtransliterator is accuracy. If possible, our techniques should also be: * portable to new language pairs like Arabic/English with minimal effort, possibly reusing resources.</Paragraph>
    <Paragraph position="13"> * robust against errors introduced by optical character recognition.</Paragraph>
    <Paragraph position="14"> * relevant to speech recognition situations in which the speaker has a heavy foreign accent.</Paragraph>
    <Paragraph position="15"> * able to take textual (topical/syntactic) context into account, or at least be able to return a ranked list of possible English translations.</Paragraph>
    <Paragraph position="16"> Like most problems in computational linguistics, this one requires full world knowledge for a 100% solution. Choosing between Katarina and Catalina (both good guesses for ~' ~ ~ &amp;quot;)-) might even require detailed knowledge of geography and figure skating. At that level, human translators find the problem quite difficult as well. so we only aim to match or possibly exceed their performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML