File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/j98-4003_evalu.xml
Size: 3,566 bytes
Last Modified: 2025-10-06 14:00:29
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-4003"> <Title>Machine Transliteration</Title> <Section position="7" start_page="609" end_page="609" type="evalu"> <SectionTitle> 5. Experiments </SectionTitle> <Paragraph position="0"> We have performed two large-scale experiments, one using a full-language P(w) model, and one using a personal name language model.</Paragraph> <Paragraph position="1"> In the first experiment, we extracted 1,449 unique katakana phrases from a corpus of 100 short news articles. Of these, 222 were missing from an on-line 100,000-entry bilingual dictionary. We back-transliterated these 222 phrases. Many of the translations are perfect: technical program, sex scandal, omaha beach, new york times, ramon diaz. Others are close: tanya harding, nickel simpson, danger washington, world cap. Some miss the mark: nancy care again, plus occur, patriot miss real. 4 While it is difficult to judge overall accuracy--some of the phrases are onomatopoetic, and others are simply too hard even for good human translators--it is easier to identify system weaknesses, and most of these lie in the P(w) model. For example, nancy kerrigan should be preferred over nancy care again.</Paragraph> <Paragraph position="2"> In a second experiment, we took (non-OCR) katakana versions of the names of 100 U.S. politicians, e.g.: -Y~ y * 7&quot;~- (jyon.buroo), T~I,,~yx * :9&quot;-v.;, }- (aruhonsu. damatto), and -v 4 ~ * Y V 4 Y (maiku. dewain). We back-transliterated these by machine and asked four human subjects to do the same. These subjects were native English speakers and news-aware; we gave them brief instructions. The results were as in Table 1.</Paragraph> <Paragraph position="3"> There is room for improvement on both sides. Being English speakers, the human subjects were good at English name spelling and U.S. politics, but not at Japanese phonetics. A native Japanese speaker might be expert at the latter but not the former. People who are expert in all of these areas, however, are rare.</Paragraph> <Paragraph position="4"> (e.g., spencer abraham / spencer abraham) phonetically equivalent, but misspelled 7% 12% (e.g., richard brian / richard bryan) incorrect 66% 24% (e.g., olin hatch / orren hatch) On the automatic side, many errors can be corrected. A first-name/last-name model would rank richard bryan more highly than richard brian. A bigram model would prefer orren hatch over olin hatch. Other errors are due to unigram training problems, or more rarely, incorrect or brittle phonetic models. For example, Long occurs much more often than Ron in newspaper text, and our word selection does not exclude phrases like Long Island. So we get long wyden instead of ron wyden. One way to fix these problems is by manually changing unigram probabilities. Reducing P(long) by a factor of ten solves the problem while maintaining a high score for P(long I rongu).</Paragraph> <Paragraph position="5"> Despite these problems, the machine's performance is impressive. When word separators (o) are removed from the katakana phrases, rendering the task exceedingly difficult for people, the machine's performance is unchanged. In other words, it offers the same top-scoring translations whether or not the separators are present; however, their presence significantly cuts down on the number of alternatives considered, improving efficiency. When we use OCR, 7% of katakana tokens are misrecognized, affecting 50% of test strings, but translation accuracy only drops from 64% to 52%.</Paragraph> </Section> class="xml-element"></Paper>