File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1176_metho.xml

Size: 14,230 bytes

Last Modified: 2025-10-06 14:08:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1176">
  <Title>Automatic Construction of Japanese KATAKANA Variant List from Large Corpus</Title>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 Construct Japanese KATAKANA
</SectionTitle>
    <Paragraph position="0"> Variant List from Large Corpus Our method consists of the following three steps.</Paragraph>
    <Paragraph position="1">  1. Collect Japanese KATAKANA words from large corpus.</Paragraph>
    <Paragraph position="2"> 2. Collect candidate pairs of KATAKANA variants from the collected KATAKANA words using a spelling similarity. 3. Select variant pairs from the candidate  pairs based on a semantic similarity.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Collect KATAKANA Words from
Large Corpus
</SectionTitle>
      <Paragraph position="0"> At the first step, we collected Japanese KATAKANA words which consist of a KATAKANA character,~(bullet), (macron-1), and(macron-2), which are commonly used as a part of KATAKANA words, using pattern matching. For example, our system collects three KATAKANA words &amp;quot;  ropean countries give up their controlling concepts and pursue the economic deregulation which Ludwig Erhard-2 of West Germany did in 1948, they may achieve the miraculous revival like West Germany.)</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Spelling Similarity
</SectionTitle>
      <Paragraph position="0"> At the second step, our system collects candidate pairs of two KATAKANA words, which are similar in spelling, from the collected KATAKANA words described in Section 3.1.</Paragraph>
      <Paragraph position="1"> We used &amp;quot;string penalty&amp;quot; to collect candidate pairs. String penalty is based on the edit distance (Hall and DOWLING, 1980) which is a similarity measure between two strings. We used the following three types of operations.</Paragraph>
      <Paragraph position="2">  * Substitution Replace a character with another character. null * Deletion Delete a character.</Paragraph>
      <Paragraph position="3"> * Insertion Insert a character.</Paragraph>
      <Paragraph position="4"> We also added some scoring heuristics to the operations based on a pronunciation similarity between characters. The rules are tuned by hand using randomly selected training data. Some examples are shown in Table 3. Here, &amp;quot;-&amp;quot; represents &amp;quot;substitution&amp;quot; and lines without - represent &amp;quot;deletion&amp;quot; or &amp;quot;insertion.&amp;quot; Note that &amp;quot;Penalty&amp;quot; represents a score of the string penalty from now on.</Paragraph>
      <Paragraph position="5"> For example, we give penalty 1 between &amp;quot; &amp;quot;and&amp;quot;,&amp;quot; because the strings become the same when we replace &amp;quot;&amp;quot;with&amp;quot; &amp;quot; and its penalty is 1 as shown in Table 3. Rules Penalty (a) -(small a) 1 (zi) -(di) 1 (macron) 1 (ha) -(ba) 2 (u) -(vu) 2 (a) -(ya) 3 (tsu) -(small tsu) 3  We analyzed hundreds of candidate pairs of training data and figured out that most KATAKANA variations occur when the string penalties were less than a certain threshold. In this paper, we set 4 for the threshold and regard KATAKANA pairs as candidate pairs when the string penalties are less than 4. The threshold was tuned by hand using randomly selected training data.</Paragraph>
      <Paragraph position="6"> For example, from the collected KATAKANA words described in Section 3.1, our system collects the pair of~and ~, since the string penalty is 3.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Context Similarity
</SectionTitle>
      <Paragraph position="0"> At the final step, our system selects variant pairs from the candidate pairs described in Section 3.2 based on a semantic similarity. We used a vector space model as a semantic similarity.</Paragraph>
      <Paragraph position="1"> In the vector space model, we treated 10 randomly selected articles from the Corpus as a context of each KATAKANA word.</Paragraph>
      <Paragraph position="2"> We divided sentences of the articles into words using JUMAN  1999) which is the Japanese morphological analyzer, and then extracted content words which consist of nouns, verbs, adjectives, adverbs, and unknown words except stopwords. Stopwords are composed of Japanese HIRAGANA characters, punctuations, numerals, common words, and so on.</Paragraph>
      <Paragraph position="3"> We used a cosine measure to calculate a semantic similarity of two KATAKANA words.</Paragraph>
      <Paragraph position="4"> Suppose that one KATAKANA word makes a context vector a and the other one makes b.</Paragraph>
      <Paragraph position="5"> The semantic similarity between two vectors a and b is calculated as follows.</Paragraph>
      <Paragraph position="7"> The cosine measure tends to overscore frequently appeared words. Therefore, in order to avoid the problem, we treated log(N +1)asa score of a word appeared in a context. Here, N represents the frequency of a word in a context.</Paragraph>
      <Paragraph position="8"> We set 0.05 for the threshold of the semantic similarity, i.e. we regard candidate pairs as variant pairs when the semantic similarities are more than 0.05. The threshold was tuned by hand using randomly selected training data.</Paragraph>
      <Paragraph position="9"> In the case of &amp;quot;~(Ludwig Erhard-1)&amp;quot; and &amp;quot;~ (Ludwig Erhard-2)&amp;quot;, the semantic similarity becomes 0.17 as shown in Table 4. Therefore, we regard them as a variant pair.</Paragraph>
      <Paragraph position="10"> Note that in Table 4, a decimal number represents a score of a word appeared in a context calculated by log(N+1). For example, the score  of{(miracle) in the first context is 0.7.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Data Preprocessing and
Performance Measures
</SectionTitle>
      <Paragraph position="0"> We conducted the experiments using the Corpus. The number of documents in the Cor-</Paragraph>
      <Paragraph position="2"> pus was 4,678,040 and the distinct number of KATAKANA words in the Corpus was 1,102,108.</Paragraph>
      <Paragraph position="3"> As for a test set, we collected candidate pairs whose string penalties range from 1 to 12. The number of collected candidate pairs was 2,590,240. In order to create sample correct KATAKANA variant data, 500 out of 2,590,240 were randomly selected and we evaluated them manually by checking their contexts. Through the evaluation, we found that no correct variant pairs appeared from 10 to 12. Thus, we think that treating candidate pairs whose string penalties range from 1 to 12 can cover almost all of correct variant pairs.</Paragraph>
      <Paragraph position="4"> To evaluate our method, we used recall (Re), precision (Pr), and F measure (F). These performance measures are calculated by the following formulas:</Paragraph>
      <Paragraph position="6"> number of pairs found and correct total number of pairscorrect</Paragraph>
      <Paragraph position="8"> number of pairs found and correct total number of pairs found</Paragraph>
      <Paragraph position="10"/>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Experiment-1
</SectionTitle>
      <Paragraph position="0"> We conducted the first experiment based on two settings; one method uses only the spelling similarity and the other method uses both the spelling similarity and the semantic similarity.</Paragraph>
      <Paragraph position="1"> Henceforth, we use &amp;quot;Method  Ext: The number of extracted candidate pairs Cor: The number of correct variant pairs among the extracted candidate pairs Note that in Method p&amp;s , we ignored candidate pairs whose string penalties ranged from 4 to 12, since we set 4 for the threshold of the string penalty as described in Section 3.2. The result is shown in Table 5. For example, when the penalty was 2, 81 out of 117 were selected as correct variant pairs in Method p and the precision was 69.2%. Also, 80 out of 98 were selected as correct variant pairs in Method p&amp;s and the precision was 81.6%.</Paragraph>
      <Paragraph position="2"> As for Penalty 1-12 of Method p , i.e. we focused on the string penalties between 1 and 12, the recall was 100%, because we regarded 269 out of 500 as correct variant pairs and Method p extracted all of them. Also, the precision was 53.8%, calculated by 269 divided by 500. Comparing Method p&amp;s to Method p , the recall and the precision of Method p&amp;s were well-balanced, since the recall was 97.4% and the precision was 89.1%.</Paragraph>
      <Paragraph position="3"> In the same way, for Penalty 1-3, i.e. the string penalties between 1 and 3, the recall of</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Method
</SectionTitle>
      <Paragraph position="0"> p was 98.1%, since five correct variant pairs between 4 and 12 were ignored and the remaining 264 out of 269 were found. The precision of Method p was 77.2%. It was 23.4% higher than the one of Penalty 1-12. Thus, F measure also improved 16.4%. This result indicates that setting 4 for the threshold works well to improve overall performance.</Paragraph>
      <Paragraph position="1">  variant pairs when the penalties were 1 and 2. However, the precision of Method p&amp;s was 16.2% higher. Thus, F measure of Method p&amp;s  improved 6.7% compared to the one of Method p .</Paragraph>
      <Paragraph position="2"> From this result, we think that taking the semantic similarity into account is a better strategy to construct Japanese KATAKANA variant list.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Experiment-2
</SectionTitle>
      <Paragraph position="0"> We investigated how many variant pairs were extracted in the case of six different spellings of &amp;quot;spaghetti&amp;quot; described in Section 1. Table 6 shows the result of all combination pairs when we applied Method p&amp;s .</Paragraph>
      <Paragraph position="1"> For example, when the penalty was 1,</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Method
</SectionTitle>
      <Paragraph position="0"> p&amp;s selected seven candidate pairs and all of them were correct. Thus, the recall was 100%. From Table 6, we see that the string penalties of all combination pairs ranged from 1 to 3 and our system selected all of them by the semantic similarity.</Paragraph>
    </Section>
    <Section position="6" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.4 Estimation of expected correct
</SectionTitle>
      <Paragraph position="0"> variant pairs We estimated how many correct variant pairs could be selected from the Corpus based on the precision of Method p&amp;s as shown in Table 5.</Paragraph>
      <Paragraph position="1"> The result is shown in Table 7. We find that the number of candidate pairs in the Corpus was 100,746 for the penalty of 1, and 56,569 for the penalty of 2, and 40,004 for the penalty of 3.</Paragraph>
      <Paragraph position="2"> For example, when the penalty was 2, we estimate that 46,178 out of 56,569 could be selected as correct variant pairs, since the precision was 81.6% as shown in Table 5. In total, we estimate that 178,569 out of 197,319 could be selected as correct variant pairs from the Corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4.5 Error Analysis-1
</SectionTitle>
    <Paragraph position="0"> As shown in Table 5, our system couldn't select two correct variant pairs using semantic similarity when the penalties were 1 and 2. We investigated the reason from the training data.</Paragraph>
    <Paragraph position="1"> The problem was caused because the contexts of the pairs were diffrent. For example, in the case of &amp;quot;&amp;quot;and&amp;quot;~ ,&amp;quot; which represent the same building material company &amp;quot;Aroc Sanwa&amp;quot; of Fukui prefecture in Japan, their contexts were completely different because of the following reason.</Paragraph>
    <Paragraph position="2"> *,~ (Aroc Sanwa): This word appeared with the name of an athlete who took part in the national athletic meet held in Toyama prefecture in Japan, and the company sponsored the athlete.</Paragraph>
    <Paragraph position="3"> ~(Aroc~Sanwa): This word was used to introduce the company in the article.</Paragraph>
    <Paragraph position="4"> Note that each context of these words was composed of only one article.</Paragraph>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4.6 Error Analysis-2
</SectionTitle>
    <Paragraph position="0">  FromTable5,weseethatthenumbersofincorrect variant pairs selected by Method p&amp;s were 18 and 14 for each penalty of 2 and 3. We investigated such cases in the training data. The example of &amp;quot;(Cart, Kart)&amp;quot; and &amp;quot; (Card)&amp;quot; is shown as follows.</Paragraph>
    <Paragraph position="1"> *, (Cart, Kart): This word was used as the abbreviation of &amp;quot;Shopping Cart,&amp;quot; &amp;quot;Racing Kart,&amp;quot; or &amp;quot;Sport Kart.&amp;quot; (Card): This word was used as the abbreviation of &amp;quot;Credit Card&amp;quot; or &amp;quot;Cash Card&amp;quot; and was also used as the meaning of &amp;quot;Schedule of Games.&amp;quot; Although these were not a variant pair, our system regarded the pair as the variant pair, because their contexts were similar. In both contexts, &amp;quot;b;(utilization),&amp;quot; &amp;quot;G(record),&amp;quot; &amp;quot;l(guest),&amp;quot; &amp;quot;b(aim),&amp;quot; &amp;quot;(team),&amp;quot; &amp;quot;(victory),&amp;quot; &amp;quot;M(high, expensive),&amp;quot; &amp;quot; (success),&amp;quot; &amp;quot;Z(entry),&amp;quot; and so on were appeared frequently and therefore the semantic similarity became high.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML