File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1024_evalu.xml

Size: 9,160 bytes

Last Modified: 2025-10-06 13:59:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1024">
  <Title>Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the gold standards and evaluation measures for evaluating the effectiveness of the above method for back-transliterating Japanese names.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Gold Standards
</SectionTitle>
      <Paragraph position="0"> Based on two publicly accessible name lists and a Japanese-to-English name lexicon, we have constructed two Gold Standards. The Japanese-to-English name lexicon is ENAMDICT 11 , which contains more than 210,000 Japanese-English name translation pairs.</Paragraph>
      <Paragraph position="1"> Gold Standard - Given Names (GS-GN): to construct a gold standard for Japanese given names, we obtained 7,151 baby names in romanji from http://www.kabalarians.com/. Of these 7,151 names, 5,115 names have kanji translations in the ENAMDICT12. We took the 5115 romanji names and their kanji translations in the ENAMDICT as the gold standard for given names.</Paragraph>
      <Paragraph position="2"> Gold Standard - Surnames (GS-SN): to construct a gold standard for Japanese surnames, we downloaded 972 surnames in romanji from http://business.baylor.edu/Phil_VanAuken/Japanes eSurnames.html. Of these names, 811 names have kanji translations in the ENAMDICT. We took these 811 romanji surnames and their kanji translations in the ENAMDICT as the gold standard for Japanese surnames.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation Measures
</SectionTitle>
      <Paragraph position="0"> Each name in romanji in the gold standards has at least one kanji representation obtained from the ENAMDICT. For each name, precision, recall, and F measures are calculated as follows:  * Precision: number of correct kanji output / total number of kanji output * Recall: number of correct kanji output / total number of kanji names in gold standard * F-measure: 2*Precision*Recall / (Precision +</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation Results and Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Effectiveness of Corpus Validation
</SectionTitle>
      <Paragraph position="0"> missing from ENAMDICT is a further justification for a name translation method as described in this paper.</Paragraph>
      <Paragraph position="1"> GS-SN, respectively. For given names, corpus validation produces the best average precision of 0.45, while the best average recall is a low 0.27.</Paragraph>
      <Paragraph position="2"> With the additional step of Web validation of the romanji-kanji combinations, the average precision increased by 62.2% to 0.73, while the best average recall improved by 7.4% to 0.29. We observe a similar trend for surnames. The results demonstrate that, through a large, mixed-lingual corpus such as the Web, we can improve both precision and recall for automatically transliterating romanji names back to kanji.</Paragraph>
      <Paragraph position="3">  and Avg F statistics achieved through corpus validation and Web validation for GS-SN.</Paragraph>
      <Paragraph position="4"> We also observe that the performance statistics for the surnames are significantly higher than those of the given names, which might reflect the different degrees of flexibility in using surnames and given names in Japanese. We would expect that the surnames form a somewhat closed set, while the given names belong to a more open set.</Paragraph>
      <Paragraph position="5"> This may account for the higher recall for surnames.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Effectiveness of Corpus Validation
</SectionTitle>
      <Paragraph position="0"> If the big, mixed-lingual Web can deliver better validation than the limited-sized monolingual corpus, why not use it at every stage of filtering? Technically, we could use the Web as the ultimate corpus for validation at any stage when a corpus is required. In practice, however, each Web access involves additional computation time for file IO, network connections, etc. For example, accessing Google took about 2 seconds per name13; gathering 13 We inserted a 1 second sleep between calls to the search engine so as not to overload the engine.</Paragraph>
      <Paragraph position="1"> statistics for about 30,000 kanji-romanji combinations14 took us around 15 hours.</Paragraph>
      <Paragraph position="2"> In the procedure described in section 3.2, we have aimed to reduce computation complexity and time at several stages. In step 2, we use bigram-based language model from a corpus to reduce the hypothesis space. In step 3, we use corpus filtering to obtain a fast validation of the candidates, before passing the output to the Web validation in step 4.</Paragraph>
      <Paragraph position="3"> Table 4 illustrates the savings achieved through these steps.</Paragraph>
      <Paragraph position="4">  each step to be passed to the next step. The percentages specify the amount of reduction in hypothesis space.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Thresholding Effects
</SectionTitle>
      <Paragraph position="0"> We have examined whether we should discard the validated candidates with low frequencies either from the corpus or the Web. The cutoff points examined include initial low frequency range 1 to 10 and then from 10 up to 400 in with increments of 5. Figure 1 and Figure 2 illustrate that, to achieve best overall performance, it is beneficial to discard candidates with very low frequencies, e.g., frequencies below 5. Even though we observe a stabling trend after reaching certain threshold points for these validation methods, it is surprising to see that, for the corpus validation method with GS-GN, with stricter thresholds, average precisions are actually decreasing. We are currently investigating this exception.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Error Analysis
</SectionTitle>
      <Paragraph position="0"> Based on a preliminary error analysis, we have identified three areas for improvements.</Paragraph>
      <Paragraph position="1"> First, our current method does not account for certain phonological transformations when the On/Kun readings are concatenated together.</Paragraph>
      <Paragraph position="2"> Consider the name &amp;quot;matsuda&amp;quot; ( a0a2a1 ). The segmentation step correctly segmented the romanji to &amp;quot;matsu-da&amp;quot;. However, in the Unihan database, 14 At this rate, checking the 21 million combinations remaining after filtering with bigrams using the Web (without the corpus filtering step) would take more than a year.</Paragraph>
      <Paragraph position="3"> the Kun reading of a1 is &amp;quot;ta&amp;quot;, while its On reading is &amp;quot;den&amp;quot;. Therefore, using the mappings from the Unihan database, we failed to obtain the mapping between the pronunciation &amp;quot;da&amp;quot; and the kanji a1 , which resulted in both low precision and recall for &amp;quot;matsuda&amp;quot;. This suggests for introducing language-specific phonological transformations or alternatively fuzzy matching to deal with the mismatch problem.</Paragraph>
      <Paragraph position="4">  corpus and corpus+Web validation with different frequency-based cutoff thresholds for GS-SN Second, ENAMDICT contains mappings between kanji and romanji that are not available from the Unihan database. For example, for the name &amp;quot;hiroshi&amp;quot; in romanji, based on the mappings from the Unihan database, we can obtain two possible segmentations: &amp;quot;hiro-shi&amp;quot; and &amp;quot;hi-ro-shi&amp;quot;. Our method produces two- and three-kanji character sequences that correspond to these romanji characters. For example, corpus validation produces the following kanji candidates for  This suggests the limitation of relying solely on the Unihan database for building mappings between romanji characters and kanji characters.</Paragraph>
      <Paragraph position="5"> Other mapping resources, such as ENAMDCIT, should be considered in our future work.</Paragraph>
      <Paragraph position="6"> Third, because the statistical part-of-speech tagger we used for Japanese term identification does not have a lexicon of all possible names in Japanese, some unknown names, which are incorrectly separated into individual kanji characters, are therefore not available for correct corpus-based validation. We are currently exploring methods using overlapping character bigrams, instead of the tagger-produced terms, as the basis for corpus-based validation and filtering.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML