File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2134_metho.xml

Size: 16,030 bytes

Last Modified: 2025-10-06 14:14:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2134">
  <Title>Document Classification Using Domain Specific Kanji Characters Extracted by X2 Method</Title>
  <Section position="3" start_page="0" end_page="794" type="metho">
    <SectionTitle>
2 Document Classification on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="794" type="sub_section">
      <SectionTitle>
Domain Specific Kanji Characters
2.1 Text Representation t)y Kanji
Characters
</SectionTitle>
      <Paragraph position="0"> In previous researches, texts were represented by significarlt woMs, and a word was regarded as a minimmn semantic unit. But a word is not a minimum semantic unit, because a word consists of one or more morphemes. Here, we propose the text representation by morpheme. We have applied this idea to the Japanese text representation, where a kanji character is a morpheme. Each kanji character has its meaning, and Japanese words (nouns, verbs, adjectives, and so on) usually contain one or more kanji characters which represent the meaning of the words to some extent.</Paragraph>
      <Paragraph position="1"> When representing the features of a text by kanji characters, it is important to consider which kanji characters are significant for the text representation and useful for classification. We assumed that these significant kanji characters appear more frequently  in one donlaii'i than the other, and extracted theni by the X 2 method. I,'rOlll llOW Oli, these kanji characters are called the domain specific kanji characlcrs. Then, we represented the conteut eta Japanese text x as the following vector of douiain specific</Paragraph>
      <Paragraph position="3"> where coinponent fi is the frequency ofdoniain SlW. ciIic kanji i and I is the nuniber of all the extracted kanji characters by the X 2 lnethod. In this way, tilt' Japanese text x is expressed as a point in the ~l.</Paragraph>
      <Paragraph position="4"> dimensional feature space the axes of which are the domain specific kanji characters. Then, we used this feature space for tel)resenting the features of the domains. Nainely, the domain vl is rel)rese.nted usilig the feature vector of doniain specific kanji characters as follows:</Paragraph>
      <Paragraph position="6"> We used this feature space llOt only for I, he text representation but also for the docunient classification. \[f the document classification is lJerforined Oil kanji characters, we may avoid the two problenls described in Section 1.</Paragraph>
      <Paragraph position="7">  1. It is simpler to extract ka, iji characters than tO extract Japanese words.</Paragraph>
      <Paragraph position="8"> 2. There are about 2,000 kanji char~tcte,'s that  are considered neccssary h)r general literacy. So, the rnaximuln number of dimensiolis of the training space is about 2,000.</Paragraph>
      <Paragraph position="9"> Of course, in our approach, the quality of the results may not be as good as lit the i)revh)us al)preaches ilSilig the words. But it is signilicanl, I.hat we can avoid the cost of iriorphologi(:at mialysis which is not so perfect.</Paragraph>
    </Section>
    <Section position="2" start_page="794" end_page="794" type="sub_section">
      <SectionTitle>
2.2 Procedure tbr the Doemnent
Classification using Kanji Characte.rs
</SectionTitle>
      <Paragraph position="0"> Our approach is the following:  1. A sample set of Japanese texts is classifie.d by a htiniaii expert.</Paragraph>
      <Paragraph position="1"> 2. Kanji characters which distribute unevenly aniong&gt; text domahm are extracted by the X 2 Iliethod. 3. The featllre vect,ors of the doliiains are obtained by the inforniation Oll donlain specilic kanji characters and its fr0qllOlioy of OCCllrrellCe. 4. Tile classification system builds a feaDtlre vc(&gt;  tor of a new doclllllel\]t, COIIllJal'es il. with the feature vectors of each doniain, an{l dcl.erlnhies the doniahi whh:h l, he docunie.nt I)c\[ongs to. Figure 1 shows it procedure for the docuinent classification /ISilI~ dOlltaill specific kanji chara.cters.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="794" end_page="798" type="metho">
    <SectionTitle>
3 Automatic Extraction of \])Olliaiil
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="794" end_page="795" type="sub_section">
      <SectionTitle>
Specific Kanji Characl;ers
3.1 The Loariling Sample
</SectionTitle>
      <Paragraph position="0"> For extracting doiriain specific kanji characters and obtaining the fea, till'e voctoi's of each domain, we ilSe articles of &amp;quot;l&lt;ncych}pedia lh:'.ibonsha&amp;quot; ~IrS the le/trning sa.nll)le. The reasoll why we use this encyclo-Imdia is thai, it is lmblished in the electronic form and contains a gi'oat liiiiill)oi' of articles. This enc.yclopedia was written by 6,727 atlthors, and COilrains about 80,000 artich'.s, 6.52 x 107 characters, and 7.52 X 107 kanji characters. An exanlple artic.le of &amp;quot;Encyclopedia lloibonsha&amp;quot; is shown in Figure 7. Unfortunately, tile articles are not classified, hut there is the author's llaliie at the end of each article and his specialty is notified in the preface.. There fore, we can chussit'y these articles into the authors' specialties autonlaLically.</Paragraph>
      <Paragraph position="1"> The specialties used i. the encyck}l)edia are wide, but they a.re not well balanced i Moreover, some doniains of the authors' specialties contain only few iFor exaniple, the specialty of Yuriko Takeuchi is Anglo American literature, oil the other hand, that of Koichi Anlano is science fiction.</Paragraph>
      <Paragraph position="2">  ............ title ................... .pronunciation .... '.'..:_.::::::::::::::! ....... k::.::---:v::.-::.:-:::::.-.-:..:'.' ........... Cext... a),~(/)tc~9, -~@\[2~&lt;~@, kgX{g-l'Y,) &gt;,y,~, :waOg3egJ;&gt;97t~%~T~_ ................ author  Heibonsha&amp;quot; articles. So, it is difficult to extract appropriate domain specific kanji characters from the articles which are classified into the authors' specialties.</Paragraph>
      <Paragraph position="3"> Therefore, it is important to consider that 206 specialties in the encyclopedia, which represent almost a half of the specialties, are used as the subjects of the domain in the Nippon Decimal Classification (NDC). For example, botany, which is one of the authors' specialties, is also one of the subjects of the domain in the NDC. In addition to this, the NDC has hierarchical domains. For keepiug the domains well balanced, we combined the specialties using the hierarchical relationship of the NDC. The procedure for combining the specialties is as follows: 1. We aligned the specialties to the domains in the NDC. 206 specialties corresponded to the domains of the NDC automatically, and the rest was aligned manually.</Paragraph>
      <Paragraph position="4"> 2. We combined 418 specialties to 59 code domains of the NDC, using its hierarchical relationship. 'Fable 1 shows an example of the hierarchical relationship of the NDC.</Paragraph>
      <Paragraph position="5"> However, 59 domains are not well balanced. For example, &amp;quot;physics&amp;quot;, &amp;quot;electric engineering&amp;quot;, and &amp;quot;German literature&amp;quot; are the code domains of the NDC, and we know these domains are not well balanced by intuition. So, for keeping the domains well balanced, we combined 59 domains to 42 manually.</Paragraph>
    </Section>
    <Section position="2" start_page="795" end_page="795" type="sub_section">
      <SectionTitle>
3.2 Selection of Domain Specific Kanji
</SectionTitle>
      <Paragraph position="0"> Characters by the X 2 Method Using the value X 2 of the X 2 test, we can detect the unevenly distributed kanji characters and extract these kanji characters as domain specific kanji characters. Indeed, it was verified that X ~ method is useful for extracting keywords instead of kanji characters(Nagao, 1976).</Paragraph>
      <Paragraph position="1"> Suppose we denote the frequency of kanji i in the domain j, mid, and we assume that kanji i is distributed evenly. Then the value X 2 ofkanji i, X~, is expressed by the equations as follows:</Paragraph>
      <Paragraph position="3"> where k is the number of varieties of the kanji characters and 1 is tile number of the domains. If the value X/2 is relatively big, we consider that the kanji i is distributed unevenly.</Paragraph>
      <Paragraph position="4"> There are two considerations about the extraction of the domain specific kanji characters using the X 2 method. The first is the size of the training samples. If the size of each training sample is different, the ranking of domain specific kanji characters is not equal to tile ranking of tile value X 2. 'File second is that we cannot recognize which domains are represented by the extracted kanji characters using only the value X :~ of equation (3). In other words, there is no guarantee that we can extract the appropriate domain specific kanji characters from every domain. From this, we have extracted the fixed number of domain specific kanji characters from every domain using the ranking of the value X~ d of equation (4) instead of (3). Not only the value X~ of equation (3) but the value X~ d of equation (4) become big when the kanji i appears more frequently in the domain j than in the other. Table 2 shows top 20 domain specific kanji characters of the 42 domains. Further, Appendix shows tim meanings of each domain specific kanji character of &amp;quot;library science&amp;quot; domain.</Paragraph>
    </Section>
    <Section position="3" start_page="795" end_page="796" type="sub_section">
      <SectionTitle>
3.3 Feature Space for the Docmnent
Classification
</SectionTitle>
      <Paragraph position="0"> In order to measure the closeness between an unclassified document and the 42 domains, we proposed a feature space the axes of which are domain specific kanji characters extracted from the 42 domains. To represent the features of an unclassified document and the 42 domains, we used feature vectot's (1) and (2) respectively. To find out the closest domain, we measured an angle between the unclassifted document and the 42 domains in the feature space. If we are given a new document the feature vector of which is x, the classification system can compute the angle 0 with each vector vi which represents the domain i and find vi with rain 0 ( vi , z ) . i Using this procedure, every document is classified into the closest domain.</Paragraph>
      <Paragraph position="1">  TaMe 1: Division of the Nippon Decimal Chtssification - technology/engineering (:lass ....</Paragraph>
      <Paragraph position="2"> 54(0) electrical engineering code 548 information engineering item 548.2 computers detailed item \ small items 548.23 memory Ilnit more detailed item J NDC is tile most popular library cl,-~ssification in Jal)an and it has tile hierarchical domains. NDC h~s the 10 classes. Fach chess is further divided into 10 codes. Each code is dcvided into t0 items, which in turn have details using one or two digits. Each domain is ~ussigned by decimal codes.</Paragraph>
    </Section>
    <Section position="4" start_page="796" end_page="797" type="sub_section">
      <SectionTitle>
4.1 Exi)erimental Results
</SectionTitle>
      <Paragraph position="0"> For evaluating our approach, we used the following three sets of articles in our experiments:  1. articles in &amp;quot;Scientific American (in .lapanese)&amp;quot; (162 articles) 2. editorial columns in Asahi Newspaper &amp;quot;TEN-SE\[ JINGO&amp;quot; (about 2,000 articles) 3. editorial articles in Asahi Newspaper (about</Paragraph>
      <Paragraph position="2"> Because the articles in &amp;quot;Scientific American (ill Japanese)&amp;quot; are not cb~ssified, we classified them manually. The articles of &amp;quot;TENSEI JINGO&amp;quot; and tile editorial articles are classified by editors into a hi- null erarchy of domains which differ from the domains of the NDC. We aligned these domains to the 42 domains described in Section 3.1. Some articles in thereof contain two or more themes, and these articles are classified into two or more domains by editors. For example, the editorial article 'qbo Many Katakana Words&amp;quot; is classified into three domains.</Paragraph>
      <Paragraph position="3"> In these cases, we.j,dge that the result of the automatic classification is correct when it corresponds to one of the domains where the document is cbLssifted by editors. Figure 3 , Figure 4, and Figure 5 describe the variations of the classification results with respect to the number of domain specific kanji characters.</Paragraph>
    </Section>
    <Section position="5" start_page="797" end_page="798" type="sub_section">
      <SectionTitle>
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> In our approach, the maximum correct recognition scores for the editorial articles and the articles in &amp;quot;Scientific American (in Japanese)&amp;quot; are 74 % and 85 %, respectively. Considering that our system uses only the statistical information of kanji characters and deals with a great amount of documents which cover various specialties, our approach achieved a good result in document classification. From this, we believe that our approach is efficient for broadly classifying various subjects of the documents, e.g.</Paragraph>
      <Paragraph position="1"> news stories. A method for classifying news stories is significant for distributing and retrieving articles in electronic newspaper.</Paragraph>
      <Paragraph position="2"> The maximum recognition scores for &amp;quot;TENSEI JINGO&amp;quot; is 47 %. The reasons why the result is far worse than the results of the other are: 1. The style of the documents The style of &amp;quot;TENSEI ,lINGO&amp;quot; is similar to that of an essay or a novel and it is written in colloquial Japanese. In contrast, the style of the editorial articles and &amp;quot;Scientific American (in Japanese)&amp;quot; is similar to that of a thesis. We think the reason why we achieved the good result in the classification of the editorial articles and &amp;quot;Scientific American (in Japanese)&amp;quot; is that many technical terms are used in there and it is likely that the kanji characters which represent the technical terms are domain specific kanji characters in that domain.</Paragraph>
      <Paragraph position="3"> 2. Two or more themes in one document Many articles of &amp;quot;TENSEI JINGO&amp;quot; contain two or more themes. In these articles, it is usual that the introductory part has little relation to the main theme. For example, the article &amp;quot;Splendid ILetirement&amp;quot;, whose main theme is the Speaker's resignation of the llouse of Representatives, ha~s an introductory part about the retirement of famous sportsmen. In conclusion, our aplJroach is not effective in classifying these articles.</Paragraph>
      <Paragraph position="4"> However, if we divide these articles into semantic objects, e.g. chapter and section, these semantic objects may be classified in our approach. Table 3 shows the results of classifying fifll text and each chapter of a book &amp;quot;Artificial Intelligence and Human Being&amp;quot;. Because this book is manually classified into tile domain  &amp;quot;information science&amp;quot; in tile N DC, it is correct that the system classified this book into the &amp;quot;information science&amp;quot;. And it is correct that the system classified Chapter 3 and Chapter 5 into the &amp;quot;linguistics&amp;quot; and &amp;quot;psychology&amp;quot;, respectively, because human language is described in Chapter 3 and human psychological aspect is described in Chapter 5.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="798" end_page="798" type="metho">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> The quality of the experimental results showed that our apl)roach enables document classification with a good ac.e|lracy, and suggested the possibility for Jat)anese documents to t)e represented on the basis of kanji characters they contain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML