File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1176_intro.xml

Size: 3,224 bytes

Last Modified: 2025-10-06 14:02:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1176">
  <Title>Automatic Construction of Japanese KATAKANA Variant List from Large Corpus</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> &amp;quot;Loan words&amp;quot; in Japanese are usually written by a phonogram type of Japanese character set, KATAKANA. Because of loan words, the transliteration causes several variations of spelling. Therefore, Japanese KATAKANA words sometimes have several different orthographies for each original word. For example, we found at least six different spellings of &amp;quot;spaghetti&amp;quot; in 38 years of Japanese newspaper articles, such as &amp;quot;,&amp;quot; &amp;quot; ,&amp;quot; &amp;quot;,&amp;quot; &amp;quot;,&amp;quot; &amp;quot; ,&amp;quot; and &amp;quot;.&amp;quot; The different expression causes problems when we use search engines, question answering systems, and so on (Yamamoto et al., 2003). For example, when we input &amp;quot;&amp;quot;asaqueryforasearchengine or a query for a question answering system, we may not be able to find the web pages or the answers for which we are looking, if a different orthography for &amp;quot;&amp;quot;isused.</Paragraph>
    <Paragraph position="1"> We investigated how many documents were retrieved by Google  when each Japanese KATAKANA variant of &amp;quot;spaghetti&amp;quot; was used as a query. The result is shown as Table 1.</Paragraph>
    <Paragraph position="2"> For example, when we inputted &amp;quot; &amp;quot; as a query of Google, 104,000 documents were retrieved and the percentage was 34.6%, calculated by 104,000 divided by 300,556. From Table 1, we see that each of six variants appears frequently and thus we may not be able to find the web pages for which we are looking.</Paragraph>
    <Paragraph position="3"> Although we can manually create Japanese KATAKANA variant list, it is a labor-intensive task. In order to solve the problem, we propose an automatic method to construct Japanese KATAKANA variant list from large corpus.</Paragraph>
    <Paragraph position="4"> Variant # of retrieved documents</Paragraph>
    <Paragraph position="6"> we inputted each Japanese KATAKANA variant of &amp;quot;spaghetti&amp;quot; as a query of Google. Our method consists of three steps. First, we collect Japanese KATAKANA words from large corpus. Then, we collect candidate pairs of KATAKANA variants based on a spelling similarity from the collected Japanese KATAKANA words. Finally, we select variant pairs using</Paragraph>
    <Paragraph position="8"> a semantic similarity based on a vector space model of a context of each KATAKANA word.</Paragraph>
    <Paragraph position="9"> This paper is organized as follows. Section 2 describes related work. Section 3 presents our method to construct Japanese KATAKANA variant list from large corpus. Section 4 shows some experimental results using 38 years of Japanese newspaper articles, which we call &amp;quot;the Corpus&amp;quot; from now on, followed by evaluation and discussion. Section 5 describes future work.</Paragraph>
    <Paragraph position="10"> Section 6 offers some concluding remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML