File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1017_evalu.xml

Size: 3,232 bytes

Last Modified: 2025-10-06 14:00:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1017">
  <Title>Machine Transliteration</Title>
  <Section position="7" start_page="133" end_page="133" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We have performed two large-scale experiments, one using a full-language P(w) model, and one using a personal name language model.</Paragraph>
    <Paragraph position="1"> In the first experiment, we extracted 1449 unique katakana phrases from a corpus of 100 short news articles. Of these, 222 were missing from an on-line 100,000-entry bilingual dictionary. We back-transliterated these 222 phrases. Many of the translations are perfect: technical program, sez scandal, omaha beach, new york times, ramon diaz. Others are close: tanya harding, nickel simpson, danger washington, world cap. Some miss the mark: nancy care again, plus occur, patriot miss real. While it is difficult to judge overall accuracy--some of the phases are onomatopoetic, and others are simply too hard even for good human translators--it is easier to identify system weaknesses, and most of these lie in the P(w) model. For example, nancy kerrigan should be preferred over nancy care again.</Paragraph>
    <Paragraph position="2"> In a second experiment, we took katakana versions of the names of 100 U.S. politicians, e.g.: -Jm :/. 7' =-- (jyon.buroo), T~/~ .</Paragraph>
    <Paragraph position="3"> ~'0' I&amp;quot; (a.rhonsu.dama~;'C/o), and &amp;quot;~'4 3' * ~7,f :/ (maiku.de~ain). We back-transliterated these by machine and asked four human subjects to do the same. These subjects were native English speakers and news-aware: we gave them brief instructions, examples, and hints. The results were as follows: correct (e.g., spencer abraham / spencer abraham) phonetically equivalent, but misspelled (e.g., richard brian /  There is room for improvement on both sides. Being English speakers, the human subjects were good at English name spelling and U.S. politics, but not at Japanese phonetics. A native Japanese speaker might be expert at the latter but not the former. People who are expert in all of these areas, however, are rare.</Paragraph>
    <Paragraph position="4"> On the automatic side. many errors can be corrected. A first-name/last-name model would rank richard bryan more highly than richard brian. A bi-gram model would prefer orren hatch over olin hatch. Other errors are due to unigram training problems, or more rarely, incorrect or brittle phonetic models. For example, &amp;quot;Long&amp;quot; occurs much more often than &amp;quot;R.on&amp;quot; in newspaper text, and our word selection does not exclude phrases like &amp;quot;Long Island.&amp;quot; So we get long wyden instead of ton wyden. Rare errors are due to incorrect or brittle phonetic models.</Paragraph>
    <Paragraph position="5"> Still the machine's performance is impressive.</Paragraph>
    <Paragraph position="6"> When word separators (,) are removed from the katakana phrases, rendering the task exceedingly difficult for people, the machine's performance is unchanged. When we use OCR. 7% of katakana tokens are mis-recognized, affecting 50% of test strings, but accuracy only drops from 64% to 52%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML