File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1119_metho.xml

Size: 10,943 bytes

Last Modified: 2025-10-06 14:08:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1119">
  <Title>Back Transliteration from Japanese to English Using Target English Context</Title>
  <Section position="3" start_page="1" end_page="12" type="metho">
    <SectionTitle>
2 Proposed Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Advantage of using English context
</SectionTitle>
      <Paragraph position="0"> First we explain the difficulty of back transliteration without a pronunciation dictionary. Next, we clarify the reason for the difficulty. Finally, we clarify the effect using English context in back transliteration.</Paragraph>
      <Paragraph position="1"> In back transliteration, an English letter or string is chosen to correspond to a katakana character or string. However, this decision is difficult. For example, there are cases that an English letter &amp;quot;u&amp;quot; corresponds to &amp;quot;a&amp;quot; of katakana, and there are cases that the same English letter &amp;quot;u&amp;quot; does not correspond to the same &amp;quot;a&amp;quot; of katakana. &amp;quot;u&amp;quot; in Cunningham corresponds to &amp;quot;a&amp;quot; in katakana and &amp;quot;u&amp;quot; in Bush does not correspond to &amp;quot;a&amp;quot; in katakana. It is difficult to resolve this ambiguity without the pronunciation registered in a dictionary. null The difference in correspondence mainly comes from the difference of the letters around the English letter &amp;quot;u.&amp;quot; The correspondence of an English letter or string to a katakana character or string varies depending on the surrounding characters, i.e., on its English context.</Paragraph>
      <Paragraph position="2"> Thus, our back transliteration method uses the target English context to calculate the probability of English letters corresponding to a katakana character or string.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="12" type="sub_section">
      <SectionTitle>
2.2 Notation and conversion-candidate
lattice
</SectionTitle>
      <Paragraph position="0"> We formulate the word conversion process as a unit conversion process for treating new words.</Paragraph>
      <Paragraph position="1"> Here, the unit is one or more characters that form a part of characters of the word.</Paragraph>
      <Paragraph position="2"> A katakana word, K, is expressed by equation 2.1 with &amp;quot;^&amp;quot; and &amp;quot;$&amp;quot; added to its start and end, respectively.</Paragraph>
      <Paragraph position="4"> k is the j-th character in the katakana word, and m is the number of characters except for &amp;quot;^&amp;quot; and &amp;quot;$&amp;quot; and</Paragraph>
      <Paragraph position="6"> We use katakana units constructed of one or more katakana characters. We denote a katakana unit as ku. For any ku, many English units, eu, could be corresponded as conversion-candidates.</Paragraph>
      <Paragraph position="7"> The ku's and eu's are generated using a learning corpus in which bilingual words are separated into units and every ku unit is related an eu unit.</Paragraph>
      <Paragraph position="8"> {}EL denotes the lattice of all eu's corresponding to ku's covering a Japanese word. Every eu is a node of the lattice and each node is connected with next nodes. {}EL has a lattice structure starting from &amp;quot;^&amp;quot; and ending at &amp;quot;$.&amp;quot; Figure 1 shows an example of {}EL corresponding to a katakana word &amp;quot;kirusiyusiyutain (ki ru shu shu ta i n).&amp;quot; In the figure, each circle represents one eu.</Paragraph>
      <Paragraph position="9"> A character string linking individual character units in the paths</Paragraph>
      <Paragraph position="11"> and &amp;quot;$&amp;quot; in {}EL becomes a conversion candidate, where q is the number of paths between &amp;quot;^&amp;quot; and &amp;quot;$&amp;quot; in {}EL .</Paragraph>
      <Paragraph position="12"> We get English word candidates by joining eu's from &amp;quot;^&amp;quot; to &amp;quot;$&amp;quot; in {}EL . We select a certain path,</Paragraph>
      <Paragraph position="14"> except for &amp;quot;^&amp;quot; and &amp;quot;$&amp;quot; in p d is expressed as () d np .</Paragraph>
      <Paragraph position="15"> The character units in p d are numbered from start to end.</Paragraph>
      <Paragraph position="16"> The English word, E, resulting from the conversion of a katakana word, K, for p</Paragraph>
      <Paragraph position="18"> is the j-th character in the English word.</Paragraph>
      <Paragraph position="19"> () d lp is the number of characters except for &amp;quot;^&amp;quot; and &amp;quot;$&amp;quot; in the English word.  To use the English context for calculating the matching of an English unit with a katakana unit, the above equation is transformed into Equation  Equation 2.7 contains a translation model in which an English word is a condition and katakana is a result.</Paragraph>
      <Paragraph position="20"> The word in the translation model (|)P K E in Equation 2.7 is broken down into character units by using equations 2.3 and 2.4.</Paragraph>
      <Paragraph position="21">  Here, a is a constant. Equation 2.11 is an (a+1)gram model of English letters. Next, we approximate the translation model  . For this, we use our previously proposed approximation technique (Goto et al., 2003). The outline of the technique is shown as follows.</Paragraph>
      <Paragraph position="23"> Equation 2.15 is the equation of our back transliteration method.</Paragraph>
      <Paragraph position="24"> 2.4 Beam search solution for context sensitive grammar Equation 2.15 includes context-sensitive grammar. As such, it can not be carried out efficiently. In decoding from the head of a word to the tail, e end(i)+1 in equation 2.15 becomes contextsensitive. Thus we try to get approximate results by using a beam search solution. To get the results, we use dynamic programming. Every node of eu in the lattice keeps the N-best results evaluated by using a letter of e end(i)+1 that gives the maximum probability in the next letters. When the results of next node are evaluated for selecting the N-best, the accurate probabilities from the previous nodes are used.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
2.5 Learning probability models based
</SectionTitle>
      <Paragraph position="0"> on the maximum entropy method The probability models are learned based on the maximum entropy method. This makes it possible to prevent data sparseness relating to the model as well as to efficiently utilize many conditions, such as context, simultaneously. We use the Gaussian Prior (Chen and Rosenfeld, 1999) smoothing method for the language model. We use one Gaussian variance. We use the value of the Gaussian variance that minimizes the test set's perplexity.</Paragraph>
      <Paragraph position="1"> The feature functions of the models based on the maximum entropy method are defined as combinations of letters. In addition, we use vowel, consonant, and semi-vowel classes for the translation model. We manually define the combinations of the letter positions such as e</Paragraph>
      <Paragraph position="3"> The feature functions consist of the letter combinations that meet the combinations of the letter positions and are observed at least once in the learning data.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
2.6 Corpus for learning
</SectionTitle>
      <Paragraph position="0"> A Japanese-English word list aligned by unit was used for learning the translation model and the chunking model and for generating the lattice of conversion candidates. The alignment was done by semi-automatically. A romanized katakana character usually corresponds to one or several English letters or strings. For example, a romanized katakana character &amp;quot;k&amp;quot; usually corresponds to an English letter &amp;quot;c,&amp;quot; &amp;quot;k,&amp;quot; &amp;quot;ch,&amp;quot; or &amp;quot;q.&amp;quot; With such heuristic rules, the Japanese-English word corpus could be aligned by unit and the alignment errors were corrected manually.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="12" end_page="12" type="metho">
    <SectionTitle>
3 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.1 Learning data and test data
</SectionTitle>
      <Paragraph position="0"> We conducted an experiment on back transliteration using English personal names. The learning data used in the experiment are described below.</Paragraph>
      <Paragraph position="1"> The Dictionary of Western Names of 80,000</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
People
</SectionTitle>
      <Paragraph position="0"> was used as the source of the Japanese-English word corpus. We chose the names in alphabet from A to Z and their corresponding katakana. The number of distinct words was 39,830 for English words and 39,562 for katakana words.</Paragraph>
      <Paragraph position="1"> The number of English-katakana pairs was  . We related the alphabet and katakana character units in those words by using the method described in section 2.6. We then used the corpus to make the translation and the chunking models and to generate a lattice of conversion candidates.</Paragraph>
      <Paragraph position="2"> The learning of the language model was carried out using a word list that was created by merging two word lists: an American personal- null , and English head words of the Dictionary of Western Names of 80,000 people. The American name list contains frequency information for each name; we also used the frequency data for the learning of the language model. A test set for evaluating the value of the Gaussian variance was created using the American name list. The list was split 9:1, and we used the larger data for learning and the smaller data for evaluating the parameter value.</Paragraph>
      <Paragraph position="3"> The test data is as follows. The test data contained 333 katakana name words of American Cabinet officials, and other high-ranking officials, as well as high-ranking governmental officials of Canada, the United Kingdom, Australia, and New Zealand (listed in the World Yearbook 2002 published by Kyodo News in Japan). The English name words that were listed along with the corresponding katakana names were used as answer words. Words that included characters other than the letters A to Z were excluded from the test data. Family names and First names were not distinguished.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
3.2 Experimental models
</SectionTitle>
      <Paragraph position="0"> We used the following methods to test the individual effects of each factor of our method.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML