File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/c04-1176_abstr.xml
Size: 1,355 bytes
Last Modified: 2025-10-06 13:43:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1176"> <Title>Automatic Construction of Japanese KATAKANA Variant List from Large Corpus</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper presents a method to construct Japanese KATAKANA variant list from large corpus. Our method is useful for information retrieval, information extraction, question answering, and so on, because KATAKANAwordstendtobeusedas &quot;loan words&quot; and the transliteration causes several variations of spelling. Our method consists of three steps. At step 1, our system collects KATAKANA words from large corpus. At step 2, our system collects candidate pairs of KATAKANA variants from the collected KATAKANA words using a spelling similarity which is based on the edit distance. At step 3, our system selects variant pairs from the candidate pairs using a semantic similarity which is calculated by a vector space model of a context of each KATAKANA word. We conducted experiments using 38 years of Japanese newspaper articles and constructed Japanese KATAKANA variant list with the performance of 97.4% recall and 89.1% precision. Estimating from this precision, our system can extract 178,569 variant pairs from the corpus.</Paragraph> </Section> class="xml-element"></Paper>