Document Classification Using Domain Specific 
Kanji Characters Extracted by X2 Method 
Yasuhiko Watanabe} Masaki Murata{ Masahito Takeuchi:~ Makoto Nagao{ 
Dept. of Electronics and Informatics, Ryukoku University, Sere, Otsu, Shiga, Japgm 
• "1 ' * :~ Dept. of Electronms and Comlnumcatlon, Kyoto University, Yoshida., Sa\]~yo, I(yoto, Japan 
watanabe@rins.ryukoku.ac.jp, {mural;a, takcuchi, nagao}(_~pine.kuee.kyoto n.ac.jp 
Abstract 
In this paper we describe a method of classifying 
Japanese text documents using domain specific kanji 
charactcrs. Text documents are generally cb~ssified 
by significant words (keywords) of the documents. 
However, it is difficult to extract these significant 
words from Japanese text, because Japanese texts 
are written without using blank spaces, such as de- 
limiters, and must be segmented into words. There- 
fore, instead of words, we used domain specific kanji 
characters which appear more frequently in one do- 
main than the other. We extracted these domain 
specific kanji characters by X ,2 method. Then, us- 
ing these domain specific kanji characters, we clas- 
sifted editorial columns "TENSEI JINGO", edito- 
rim articles, and articles in "Scientific American (in 
Japanese)". The correct recognition scores for' them 
were 47%, 74%, and 85%, respectively. 
1 Introduction 
Document cl~sification has been widely investigated 
for assigning domains to documents for text retrieval, 
or aiding human editors in assigning such domains. 
Various successful systems have been developed to 
classify text documents (Blosseville, 1992; Guthrie, 
1994; Ilamill, 1980; Masand, 1992; Young, 1985). 
Conventional way to develop document classifica- 
tion systems can be divided into the following two 
groups: 
1. semantic approach 
2. statistical approach 
In the semantic approach, document classification is 
based on words and keywords of a thesaurus. If the 
thesaurus is constructed well, high score is achieved. 
But this approach has disadvantages in terms of de- 
velopment and maintenance. On the other hand, in 
the statistical approach, a human exl)ert classifies 
a sample set of documents into predefined domains, 
and the computer learns from these samples how 
to classify documents into these domains. '\]'his ap- 
proach offers advantages in terms of development 
and maintenance, but the quality of the results is 
not good enough in comparison with the semantic 
approach. In either approach, document classifica- 
tion using words has problems as follows: 
1. Words in the documents must be normalized 
for matching those in the dictionary and the 
thesaurus. Moreover, in the case of Japanese 
texts, it is difficult to extract words from them, 
because they are written without using blank 
spaces as delimiters and must be segmented 
into words. 
2. A simple word extraction technique generates 
to() many words. In the statistical approach, 
the dimensions of tim training space are too 
big au(l tim classification process usually fails. 
Therefore, the. Jal)anese document classification on 
words needs a high l)recision Japanese morpholog- 
ical analyzer and a great amount of lexical knowl- 
edge. Considering these disadvantages, we propose 
a new method of document classification on kanfi 
character,s, on which document classification is per- 
formed without a morphological analyzer and lexi- 
eel knowledge. In our approach, we extracted do- 
main specific kanji characters for' document classi- 
fication by the X 2 metho(I. The features of docu- 
lnents and domains are rel-)resented using the tim_ 
ture space the axes of which are these domain spe- 
cific kanji characters. Then, we classified Japanese 
documents into domains by mea~suring the similar- 
ity between new documents and the domains in the 
feature space. 
2 Document Classification on 
Domain Specific Kanji Characters 
2.1 Text Representation t)y Kanji 
Characters 
In previous researches, texts were represented by 
significarlt woMs, and a word was regarded as a min- 
immn semantic unit. But a word is not a minimum 
semantic unit, because a word consists of one or 
more morphemes. Here, we propose the text repre- 
sentation by morpheme. We have applied this idea 
to the Japanese text representation, where a kanji 
character is a morpheme. Each kanji character has 
its meaning, and Japanese words (nouns, verbs, ad- 
jectives, and so on) usually contain one or more 
kanji characters which represent the meaning of the 
words to some extent. 
When representing the features of a text by kanji 
characters, it is important to consider which kanji 
characters are significant for the text representation 
and useful for classification. We assumed that these 
significant kanji characters appear more frequently 
794 
samp!e set 
of 
Japanese texts 
2 x method 
input 
/ g 
/ 
? 
d # 
/ 
feature space / / 
for " / 
document /" /" ,# 
classification ' ,' /' / ,,'"" /////" 
, 
,,// ......... 
measure the similarit yl 
'° J he feature space .................. " ........ classification 
process 
/ 
Z-22  
philosophy ..~ 7-\]---- 
.J library science ..~ 
Figure 1: A Procedure ibr the l)OCllliient (;lassilication Ushlg I)olliain Sl)ecilic Kanji Characters 
in one donlaii'i than the other, and extracted theni 
by the X 2 method. I,'rOlll llOW Oli, these kanji charac- 
ters are called the domain specific kanji characlcrs. 
Then, we represented the conteut eta Japanese 
text x as the following vector of douiain specific 
kanji characters: 
x = (fl, f2 ..... f/ ..... /I), (1) 
where coinponent fi is the frequency ofdoniain SlW. - 
ciIic kanji i and I is the nuniber of all the extracted 
kanji characters by the X 2 lnethod. In this way, tilt' 
Japanese text x is expressed as a point in the ~l. 
dimensional feature space the axes of which are the 
domain specific kanji characters. Then, we used this 
feature space for tel)resenting the features of the do- 
mains. Nainely, the domain vl is rel)rese.nted usilig 
the feature vector of doniain specific kanji charac- 
ters as follows: 
Vi = (fl, f2,..., St,..., .\[1). (2) 
We used this feature space llOt only for I, he text 
representation but also for the docunient classifica- 
tion. \[f the document classification is lJerforined Oil 
kanji characters, we may avoid the two problenls 
described in Section 1. 
1. It is simpler to extract ka, iji characters than tO 
extract Japanese words. 
2. There are about 2,000 kanji char~tcte,'s that 
are considered neccssary h)r general literacy. 
So, the rnaximuln number of dimensiolis of the 
training space is about 2,000. 
Of course, in our approach, the quality of the 
results may not be as good as lit the i)revh)us al)- 
preaches ilSilig the words. But it is signilicanl, I.hat 
we can avoid the cost of iriorphologi(:at mialysis which 
is not so perfect. 
2.2 Procedure tbr the Doemnent 
Classification using Kanji Characte.rs 
Our approach is the following: 
1. A sample set of Japanese texts is classifie.d by 
a htiniaii expert. 
2. Kanji characters which distribute unevenly aniong> 
text domahm are extracted by the X 2 Iliethod. 
3. The featllre vect,ors of the doliiains are obtained 
by the inforniation Oll donlain specilic kanji 
characters and its fr0qllOlioy of OCCllrrellCe. 
4. Tile classification system builds a feaDtlre vc(> 
tor of a new doclllllel\]t, COIIllJal'es il. with the 
feature vectors of each doniain, an{l dcl.erlnhies 
the doniahi whh:h l, he docunie.nt I)c\[ongs to. 
Figure 1 shows it procedure for the docuinent clas- 
sification /ISilI~ dOlltaill specific kanji chara.cters. 
3 Automatic Extraction of \])Olliaiil 
Specific Kanji Characl;ers 
3.1 The Loariling Sample 
For extracting doiriain specific kanji characters and 
obtaining the fea, till'e voctoi's of each domain, we ilSe 
articles of "l<ncych}pedia lh:'.ibonsha" ~IrS the le/trn- 
ing sa.nll)le. The reasoll why we use this encyclo- 
Imdia is thai, it is lmblished in the electronic form 
and contains a gi'oat liiiiill)oi' of articles. This en- 
c.yclopedia was written by 6,727 atlthors, and COil- 
rains about 80,000 artich'.s, 6.52 x 107 characters, 
and 7.52 X 107 kanji characters. An exanlple arti- 
c.le of "Encyclopedia lloibonsha" is shown in Figure 
7. Unfortunately, tile articles are not classified, hut 
there is the author's llaliie at the end of each article 
and his specialty is notified in the preface.. There 
fore, we can chussit'y these articles into the authors' 
specialties autonlaLically. 
The specialties used i. the encyck}l)edia are wide, 
but they a.re not well balanced i Moreover, some 
doniains of the authors' specialties contain only few 
iFor exaniple, the specialty of Yuriko Takeuchi is 
Anglo American literature, oil the other hand, that of 
Koichi Anlano is science fiction. 
795 
............ title ................... .pronunciation 
.... '.'..:_.::::::::::::::! ....... k::.::---:v::.-::.:-:::::.-.-:..:'.' ........... Cext... 
a),~(/)tc~9, -~@\[2~<~@, kgX{g-l'Y- 
,) >,y,~, :waOg3egJ;>97t~%~T~_ 
................ author 
Figure 2: An Example Article of "Encyclopedia 
Heibonsha" 
articles. So, it is difficult to extract appropriate 
domain specific kanji characters from the articles 
which are classified into the authors' specialties. 
Therefore, it is important to consider that 206 
specialties in the encyclopedia, which represent al- 
most a half of the specialties, are used as the sub- 
jects of the domain in the Nippon Decimal Classifi- 
cation (NDC). For example, botany, which is one of 
the authors' specialties, is also one of the subjects 
of the domain in the NDC. In addition to this, 
the NDC has hierarchical domains. For keepiug the 
domains well balanced, we combined the specialties 
using the hierarchical relationship of the NDC. The 
procedure for combining the specialties is as follows: 
1. We aligned the specialties to the domains in the 
NDC. 206 specialties corresponded to the do- 
mains of the NDC automatically, and the rest 
was aligned manually. 
2. We combined 418 specialties to 59 code do- 
mains of the NDC, using its hierarchical re- 
lationship. 'Fable 1 shows an example of the 
hierarchical relationship of the NDC. 
However, 59 domains are not well balanced. For ex- 
ample, "physics", "electric engineering", and "Ger- 
man literature" are the code domains of the NDC, 
and we know these domains are not well balanced 
by intuition. So, for keeping the domains well bal- 
anced, we combined 59 domains to 42 manually. 
3.2 Selection of Domain Specific Kanji 
Characters by the X 2 Method 
Using the value X 2 of the X 2 test, we can detect 
the unevenly distributed kanji characters and ex- 
tract these kanji characters as domain specific kanji 
characters. Indeed, it was verified that X ~ method 
is useful for extracting keywords instead of kanji 
characters(Nagao, 1976). 
Suppose we denote the frequency of kanji i in 
the domain j, mid, and we assume that kanji i is 
distributed evenly. Then the value X 2 ofkanji i, X~, 
is expressed by the equations as follows: 
I 
j=l 
d _ (*'d (4) rlzij 
1 
xij k 
j=l mid- k , x~.it (s) 
i=1 d:l 
where k is the number of varieties of the kanji char- 
acters and 1 is tile number of the domains. If the 
value X/2 is relatively big, we consider that the kanji 
i is distributed unevenly. 
There are two considerations about the extrac- 
tion of the domain specific kanji characters using 
the X 2 method. The first is the size of the training 
samples. If the size of each training sample is differ- 
ent, the ranking of domain specific kanji characters 
is not equal to tile ranking of tile value X 2. 'File sec- 
ond is that we cannot recognize which domains are 
represented by the extracted kanji characters using 
only the value X :~ of equation (3). In other words, 
there is no guarantee that we can extract the ap- 
propriate domain specific kanji characters from ev- 
ery domain. From this, we have extracted the fixed 
number of domain specific kanji characters from ev- 
ery domain using the ranking of the value X~ d of 
equation (4) instead of (3). Not only the value X~ 
of equation (3) but the value X~ d of equation (4) be- 
come big when the kanji i appears more frequently 
in the domain j than in the other. Table 2 shows 
top 20 domain specific kanji characters of the 42 
domains. Further, Appendix shows tim meanings 
of each domain specific kanji character of "library 
science" domain. 
3.3 Feature Space for the Docmnent 
Classification 
In order to measure the closeness between an un- 
classified document and the 42 domains, we pro- 
posed a feature space the axes of which are domain 
specific kanji characters extracted from the 42 do- 
mains. To represent the features of an unclassified 
document and the 42 domains, we used feature vec- 
tot's (1) and (2) respectively. To find out the closest 
domain, we measured an angle between the unclas- 
sifted document and the 42 domains in the feature 
space. If we are given a new document the feature 
vector of which is x, the classification system can 
compute the angle 0 with each vector vi which rep- 
resents the domain i 
and find vi with 
rain 0 ( vi , z ) . i 
Using this procedure, every document is classified 
into the closest domain. 
796 
TaMe 1: Division of the Nippon Decimal Chtssification 
- technology/engineering (:lass .... 
54(0) electrical engineering code 
548 information engineering item 
548.2 computers detailed item \ small items 
548.23 memory Ilnit more detailed item J 
NDC is tile most popular library cl,-~ssification in Jal)an and it has tile hierarchical domains. NDC 
h~s the 10 classes. Fach chess is further divided into 10 codes. Each code is dcvided into t0 items, 
which in turn have details using one or two digits. Each domain is ~ussigned by decimal codes. 
'Fable 2: Top 20 I)omain Specific Kanji Characters of tile 42 I)omains 
Domain 
library science 
philosophy 
psychology 
science of religion 
sociology 
politics 
econolnics 
law 
military science 
l)edagogy 
corn merce 
folklore 
scientific history 
mathematics 
information science 
~Lst ronolny 
physics 
chemistry 
earth science 
archeology 
biology 
botaoy 
zoology 
medical science 
engineering 
agricnlture 
management 
chemical industry 
machinery 
architectu re 
art 
environment 
printing 
music/dance 
amusement 
linguistics 
Western literature 
Eastern literature 
geography 
ancient history 
Western history 
F,a~ster n history 
Domain Specific l(anji Characters 
(BIG ~----- the vah, e X~, of equation (4) --~ SMALl,) 
,L, ~ ~t! ~/ ;~/ ~ :7:- II1~ I~ ~J ,~ ~ iN -~fl *0~ ~B L~ ;~; ~ f~ 
~?. ~ ~ rOE ~"-I ~. N ~1 ,,,,~ ~~ I',~ ~;q ~ ~ q i'd~ t~1 ~.:: ~ ~., ?,~ 
:~ di :~I~ ~- ~ ~ ,A..5. ~ ~,l I~ :tt~ ,~ g~ ~. $~ ~_ I~ ~ ,~ -t: 
:if- I~11 ~i Ell I~ F# 5'~ g': ~ ~ IIq I~ t"l .~, f~ }I~ N i~l N 
.... ,~1 ~ J :~: ~ ~I _~. :~ b~ 7, ~ i~ ~ ~ ~ ~ ..~ ~.~r: 
Jll Ill lt~ I1.1. ,% tile i~'~J :IE N .~Ig t~g A ~ th" llll~ 5~ N li: N I I 
~¢~ ~ 1~ ~ I~1 .q~ q-: ~ ~a ~ :~ ~ .,'i'i fit ~ /~ ",g :j'~ ~ ,~ 
4 Document Classification Using 
Domain Specific Kanji Characters 
4.1 Exi)erimental Results 
For evaluating our approach, we used the following 
three sets of articles in our experiments: 
1. articles in "Scientific American (in .lapanese)" 
(162 articles) 
2. editorial columns in Asahi Newspaper "TEN- 
SE\[ JINGO" (about 2,000 articles) 
3. editorial articles in Asahi Newspaper (about 
3,000 articles) 
Because the articles in "Scientific American (ill Japa- 
nese)" are not cb~ssified, we classified them manu- 
ally. The articles of "TENSEI JINGO" and tile 
editorial articles are classified by editors into a hi- 
797 
erarchy of domains which differ from the domains 
of the NDC. We aligned these domains to the 42 
domains described in Section 3.1. Some articles in 
thereof contain two or more themes, and these arti- 
cles are classified into two or more domains by edi- 
tors. For example, the editorial article 'qbo Many 
Katakana Words" is classified into three domains. 
In these cases, we.j,dge that the result of the au- 
tomatic classification is correct when it corresponds 
to one of the domains where the document is cbLs- 
sifted by editors. Figure 3 , Figure 4, and Figure 5 
describe the variations of the classification results 
with respect to the number of domain specific kanji 
characters. 
4.2 Evaluation 
In our approach, the maximum correct recognition 
scores for the editorial articles and the articles in 
"Scientific American (in Japanese)" are 74 % and 85 
%, respectively. Considering that our system uses 
only the statistical information of kanji characters 
and deals with a great amount of documents which 
cover various specialties, our approach achieved a 
good result in document classification. From this, 
we believe that our approach is efficient for broadly 
classifying various subjects of the documents, e.g. 
news stories. A method for classifying news stories 
is significant for distributing and retrieving articles 
in electronic newspaper. 
The maximum recognition scores for "TENSEI 
JINGO" is 47 %. The reasons why the result is far 
worse than the results of the other are: 
1. The style of the documents 
The style of "TENSEI ,lINGO" is similar to 
that of an essay or a novel and it is written in 
colloquial Japanese. In contrast, the style of 
the editorial articles and "Scientific American 
(in Japanese)" is similar to that of a thesis. We 
think the reason why we achieved the good re- 
sult in the classification of the editorial articles 
and "Scientific American (in Japanese)" is that 
many technical terms are used in there and it is 
likely that the kanji characters which represent 
the technical terms are domain specific kanji 
characters in that domain. 
2. Two or more themes in one document 
Many articles of "TENSEI JINGO" contain two 
or more themes. In these articles, it is usual 
that the introductory part has little relation 
to the main theme. For example, the article 
"Splendid ILetirement", whose main theme is 
the Speaker's resignation of the llouse of Rep- 
resentatives, ha~s an introductory part about 
the retirement of famous sportsmen. In conclu- 
sion, our aplJroach is not effective in classifying 
these articles. 
However, if we divide these articles into se- 
mantic objects, e.g. chapter and section, these 
semantic objects may be classified in our ap- 
proach. Table 3 shows the results of classifying 
fifll text and each chapter of a book "Artifi- 
cial Intelligence and Human Being". Because 
this book is manually classified into tile domain 
g, 
g 
Y 
o 
0.6 - -- • ~ ---- 
0.55 
0.5 
i 
0.45 
0.4 
0.35 
0.3 
0 
"TENSEI JINGO" 
__1 4... ~ .__J .L .__ L__& ._ L & 
10 20 30 40 50 60 70 80 90 100 
The Number ol Domain Specific Kanji Characters in Individual Do,'nains 
Figure 3: Variations of the Classification Results 
for "TENSEI JINGO" by the Number of Domain 
Specific Kanji Characters in Individual I)omains 
0.85 - , • ~ , , , , , , 
"editorial articles" * 
0.8 
o ~ 0.75 (,9 
g 
o~ 0.7 
0.65 
o 
0.6 
//: 
0.55 • ~ ~ ~ L _ L . • L • 
10 20 30 40 50 60 70 80 90 100 
The Number of Domain Speedic Kanli Characlers m individual Domains 
Figure 4: Variations of tile ,lasslhcatlon Results 
for the editorial articles by the Number of Domain 
Specific Kanji Characters in Individual l)onmins 
0.9 - ~ --i " f ~ ~ -- f i " r ~ -- r - 
"Scienlilic_ Am e r:ca n" * 
0.85 
0.8 
0.75 
0.7 
0.65 
0.6 
10 20 30 40 50 60 70 80 90 100 
The Number of Domain Specific Kanji Characters in Individual Domains 
Figure 5: Variatious of the Classification Results for 
"Scientific American (in Japanese)" by the Number 
of Domain Specific Kanji Characters in Individual 
Domains 
798 
Table 3: A Classification Result of a book "Artiticial Intelligm,ce and lluman l~eil,g" 
Chapter Title l{.esult 
Chapter 1 
Chapter 2 
Chal)ter 3 
Chapter 4 
Chapter 5 
The Ability of Coml)uters 
Challenge to Iluman Recognition 
Aspects of Natural Language 
What is the Understanding ? 
Artificial Intelligence and Philosophy 
information science 
information science 
linguistics 
information science 
psychology 
Full Text of "Artificial Intelligence and lluman Being" information science 
"information science" in tile N DC, it is correct 
that the system classified this book into the 
"information science". And it is correct that 
the system classified Chapter 3 and Chapter 5 
into the "linguistics" and "psychology", respec- 
tively, because human language is described in 
Chapter 3 and human psychological aspect is 
described in Chapter 5. 
5 Conclusion 
The quality of the experimental results showed that 
our apl)roach enables document classification with 
a good ac.e|lracy, and suggested the possibility for 
Jat)anese documents to t)e represented on the basis 
of kanji characters they contain. 
6 Future Work 
Because the training samples are created withovt 
this application in mind, we may be able to im- 
prove the performance by increasing the size of the 
training samples or by using different samples which 
have the similar styles and contents to the docu- 
ments. We would also like to study the relation 
between tile quality of the classification result and 
the size of the documents. 
References 
Blosseville M.J, It6brail G., Monteil M.G., Pdnot N.: 
"Autontatic Document Classilleation: Natural l,an- 
guage Processing, St~ttistical Analysis, ~tnd Expert 
System Techniques used together", SI(',IR '92, Pl). 
51- 58, 1992. 
Gnthrie I,., Walker E., Guthrie J.: "I)OCUM I",NT CI,AS 
SIFICATION BY MACIIINE:Theory and Practice", 
COLING 94, pp. 10591063, 1994. 
Hamill K.A., Zamora A.: "The Use of Titles for Auto- 
matic Do(:ulr,ent Classifi(:ation", ,lournal of the Amer- 
ican Society for hfformation Science, pp. 396 402, 
1980. 
Masand B., Linoff G., Waltz l).: "Classifying News Sto- 
ries using Memory Based ll.easoning", S\[GI\]{ '92, pp. 
59-65, 1992. 
Nagao M., Mizutani M, lkeda II.: "An Automatic Method 
of the Extraction of Important Words from Japanese 
Scientific Documents" (in Japanese), Transactions of 
IPSJ, Vol.17 No.2, pp.ll0 ll7, 1976. 
Yonng S.R., Hayes P.J.: "Automatic Classilication and 
Summarization of Banking Telexcs", Proceedings of 
the Seco,d IEEE Conference on A\[ Applications, pp. 
402-408, 1985. 
Appendix 
The meanings of each domain speci tic kanji character of 
tile "library science" category are as h)llows: 
write; draw; writing, art of writing, (:alligraphy, pen- 
manship; books, literary work; letter, note 
printing block, printing plate, wood block; imblish- 
ing, printing; printing, edition, impression; 
building, hall, mansion, manor ; suffix of public build- 
ing (esp. a la.rge bvilding for cultural activities), 
hall, edifice, pavilion; 
counter for books, volumes or copies; bound book, 
volume, copy; 
~i~ storehouse, waLrehouse, storage chamber 
7-4'< basis, base, foundation; origin, source, root, begin- 
ning; book, volume, work, magazine; this, the so,he, 
the present; head, main, l)rineii)al; real, true, gen- 
uine; counter for cylindrical objects (bottles, pen- 
eils, etc) 
~1~ paper; newspaper, periodical, publication 
T town subsection, city block-size area; counter for 
dished of food, blocks of tofu, guns; two-page leaf 
of pal)er 
\[\] drawing, diagram, plan, figure, illustration, picture; 
map, chart; systematic plan, scheme, attempt, in- 
ten tion; 
~I\] paste, glue; starch, sizing 
:\]q\] l)ublish; publication, edit.ion, issue 
~lJ print, put in print; counter for printings 
Eli (vismd sign)seal, stamp, sea\[ impression; sign, mark, 
symbol, imprint; print; India 
#O} volume, I)ook; ,'oil, reel; roll up, roll, scroll, wind, 
coil 
notebook, book, register; counter for quires (of pa- 
per), fohting screens, volumes of Japanese books, 
etc.; counter for tatami mats 
magazine, periodical, suffix names of magazines of 
periodicals; write down, chronicle 
5~ letter, character, script, inscription; writing, compo- 
sition, sentence, text, document, style; letters, lit- 
erature, the pen; culture, learning, the arts design; 
letter, note; 
~'@ store, put away, lay by; own, possess, keep (a collec- 
tion of books); storehouse, storing place, trcas/lry 
}/~ break, be lolded, bent; turn (left/right); yield, com- 
promise 
prison, jail; hell; lawsuit, litigation 
799 
