File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2134_intro.xml

Size: 2,846 bytes

Last Modified: 2025-10-06 14:06:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2134">
  <Title>Document Classification Using Domain Specific Kanji Characters Extracted by X2 Method</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Document cl~sification has been widely investigated for assigning domains to documents for text retrieval, or aiding human editors in assigning such domains.</Paragraph>
    <Paragraph position="1"> Various successful systems have been developed to classify text documents (Blosseville, 1992; Guthrie, 1994; Ilamill, 1980; Masand, 1992; Young, 1985).</Paragraph>
    <Paragraph position="2"> Conventional way to develop document classification systems can be divided into the following two groups:  1. semantic approach 2. statistical approach  In the semantic approach, document classification is based on words and keywords of a thesaurus. If the thesaurus is constructed well, high score is achieved. But this approach has disadvantages in terms of development and maintenance. On the other hand, in the statistical approach, a human exl)ert classifies a sample set of documents into predefined domains, and the computer learns from these samples how to classify documents into these domains. '\]'his approach offers advantages in terms of development and maintenance, but the quality of the results is not good enough in comparison with the semantic approach. In either approach, document classification using words has problems as follows: 1. Words in the documents must be normalized for matching those in the dictionary and the thesaurus. Moreover, in the case of Japanese texts, it is difficult to extract words from them, because they are written without using blank spaces as delimiters and must be segmented into words.</Paragraph>
    <Paragraph position="3">  2. A simple word extraction technique generates to() many words. In the statistical approach,  the dimensions of tim training space are too big au(l tim classification process usually fails. Therefore, the. Jal)anese document classification on words needs a high l)recision Japanese morphological analyzer and a great amount of lexical knowledge. Considering these disadvantages, we propose a new method of document classification on kanfi character,s, on which document classification is performed without a morphological analyzer and lexieel knowledge. In our approach, we extracted domain specific kanji characters for' document classification by the X 2 metho(I. The features of doculnents and domains are rel-)resented using the tim_ ture space the axes of which are these domain specific kanji characters. Then, we classified Japanese documents into domains by mea~suring the similarity between new documents and the domains in the feature space.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML