Automatic Text Categorization by Unsupervised Learning 
Youngjoong Ko 
Department of Colnputer Science, 
Sogang University 
1 Sinsu-dong, Mapo-gu 
Seoul, 121-742, Korea 
kyj @nlpzodiac.sogang.ac.kr, 
Jungyun Seo 
Depmlment of Computer Science, 
Sogang University 
1 Sinsu-dong, Mapo-gu 
Seoul, 121-742, Korea 
seoiy @ccs.sogang.ac.kr 
Abstract 
The goal of text categorization is to classify 
documents into a certain number of pre- 
defined categories. The previous works in 
this area have used a large number of 
labeled training doculnents for supervised 
learning. One problem is that it is difficult to 
create the labeled training documents. While 
it is easy to collect the unlabeled documents, 
it is not so easy to manually categorize them 
for creating traiuing documents. In this 
paper, we propose an unsupervised learning 
method to overcome these difficulties. The 
proposed lnethod divides the documents into 
sentences, and categorizes each sentence 
using keyword lists of each category and 
sentence simihuity measure. And then, it 
uses the categorized sentences for refining. 
The proposed method shows a similar 
degree of performance, compared with the 
traditional supervised learning inethods. 
Therefore, this method can be used in areas 
where low-cost text categorization is needed. 
It also can be used for creating training 
documents. 
Introduction 
With the rapid growth of the internet, the 
availability of on-line text information has been 
considerably increased. As a result, text 
categorization has become one of the key 
techniques fox handling and organizing text data. 
Automatic text categorization in the previous 
works is a supervised learning task, defined as 
assigning category labels (pro-defined) to text 
documents based on the likelihood suggested by 
a training set of labeled doculnents. However, 
the previous learning algorithms have some 
problems. One of them is that they require a 
large, often prohibitive, number of labeled 
training documents for the accurate learning. 
Since the application area of automatic text 
categorization has diversified froln newswire 
articles and web pages to electronic mails and 
newsgroup postings, it is a difficult task to 
create training data for each application area 
(Nigam K. et al., 1998). 
In this paper, we propose a new automatic text 
categorization lnethod based on unsupervised 
learning. Without creating training documents 
by hand, it automatically creates training 
sentence sets using keyword lists of each 
category. And then, it uses them for training and 
classifies text documents. The proposed method 
can provide basic data fox" creating training 
doculnents from collected documents, and can 
be used in an application area to classify text 
documents in low cost. We use the 2 / statistic 
(Yang Y. et al., 1998) as a feature selection 
method and the naive Bayes classifier 
(McCailum A. et al., 1998) as a statistical text 
classifier. The naive Bayes classifier is one of 
the statistical text classifiers that use word 
frequencies as features. Other examples include 
k-nearest-neighbor (Yang Y. et al., 1994), 
TFIDF/Roccio (Lewis D.D. et al., 1996), 
support vector machines (Joachilns T. et al., 
1998) and decision tree (Lewis D.D. et al., 
1994). 
1 Proposal: A text categorization scheme 
The proposed system consists of three modules 
as shown in Figure 1; a module to preprocess 
collected documents, a module to create training 
sentence sets, and a module to extract features 
and to classify text doculnents. 
453 
............. 1\] 
i 
J 
Text (~ 
/ 
/ " 
,L',.~g,,r} i, I ) 
.1 
.l&ll ego 13', 
Ax~J~llJll~ ', /, 
I( ",'11 ello I'1\] 
Figurel : Architecture for the proposed system 
1.1 Preprocessing 
First, the html tags and special characters in the 
collected documents are removed. And then, the 
contents of the documents are segmented into 
sentences. We extract content words for each 
sentence using only nouns. In Korean, there are 
active-predicative common nouns which become 
verbs when they am combined with verb- 
derivational suffixes (e.g., ha-ta 'do', toy-la 
'become', etc.). There are also stative- 
predicative common nouns which become 
adjectives when they are combined with 
adjective-derivational suffixes such as ha. These 
derived verbs and adjectives are productive in 
Korean, and they are classified as nouns 
according to the Korean POS tagger. Other 
verbs and adjectives are not informative in many 
cases. 
1.2 Creating training sentence sets 
Because the proposed system does not have 
training documents, training sentence sets for 
each category corresponding to the training 
documents have to be created. We define 
keywords for each category by hand, which 
contain special features of each category 
sufficiently. To choose these keywords, we first 
regard category names and their synonyms as 
keywords. And we include several words that 
have a definite meaning of each category. The 
average number of keywords for each category 
is 3. (Total 141 keywords for 47 categories) 
Table 1 lists the examples of keywords for 
each category. 
Table 1: Examples of keywords for each category 
Category Keywords 
ye-hayng (trip), 
kwan-kwang 
(sightseeing) 
Um-ak(music) 
Cong-kyo 
(religion) 
Pang-song 
(broadcasting) 
ye-hayng (trip), 
kwan-kwang (sightseeing) 
Um-ak (music) 
Cong-kyo (religion), 
chen-cwu-kyo(Catholicism) 
ki-tok-kyo(Christianity), 
pwul-kyo(Buddhism) 
Pang-song (broadcasting), TV, thal- 
ley-pi-cyen(television), la-ti-o(radio) 
Next, the sentences which contain pre-defined 
keywords of each category in their content 
words are chosen as the initial representative 
sentences. The remaining sentences am called 
unclassified sentences. We scale up the 
representative sentence sets by assigning the 
unclassified sentences to their related category. 
This assignment has been done through 
measuring similarities of the unclassified 
sentences to the representative sentences. We 
will elaborate this process in the next two 
subsections. 
1.2.1 Extracting and verifying representative 
sentences 
We define the representative sentence as what 
contains pre-defined keywords of the category in 
its content words. But there exist error sentences 
in the representative sentences. They do not 
have special features of a category even though 
they contain the keywords of the category. To 
relnove such error sentences, we can rank the 
representative sentences by computing the 
weight of each sentence as follows: 
1) Word weights are computed using Term 
Frequency (TF) and Inverse Category Frequency 
(ICF) (Cho K. et al., 1997). 
@ The within-category word frequency(TF~j), 
TFij = the number of times words ti occurs 
in the j th category (1) 
® In Information Retrival, Inverse Document 
Frequency (IDF) are used generally. But a 
sentence is a processing unit in the 
proposed method. Therefore, the document 
frequency cannot be counted. Also, since 
ICF was defined by Cho K. et al. (1997) 
454 
and its efficiency was verified, we use it in 
tile proposed method. ICF is computed as 
follows: 
ICF i -- Iog(M ) - Iog(CF i ) (2) 
® 
where CF is tile number of categories that 
contain t;, and M is tile total number of 
categories. 
Tile Colnbination (TFICF) of the above (9 
and ®, i.e., weight w~ i of word t; in ./tit 
category is computed as follows: 
wij --- TFii x ,cl~,. 
: Tl'ij X (log(M) - log(CF i ) ) (3) 
2) Using word weights (%) computed in 1), a 
sentence weight (We) in jth category are 
computed as follows: 
W Ij q- W2j +...-F WNj W!/ = (4) 
N 
where N is the total number of words in a 
sentence. 
3) The representative sentences of each category 
are sorted in the decreasing order of weight, 
which was computed in 2). And then, the lop 
70% of tile representative sentences are selected 
and used in our experiment. It is decided 
empirically. 
1.2.2 Extending representative sentence sets 
To extend lhe representative sentence sets, the 
unclassified sentences are classified into their 
related category through measuring similarities 
of the unclassified sentences to the 
representative sentences. 
(l) Measurement of word and sentence 
similarities 
As similar words tend to appear in similar 
contexts, we compute the similarity by using 
contextual information (Kim H. et al., 1999; 
Karov Y. et al., 1999). In this paper, words and 
sentences play COlnplementary roles. That is, a 
sentence is represented by the set of words it 
contains, and a word by the set of sentences in 
which it appears. Sentences are simihu" to the 
extent that they contain similar words, and 
words are similar to the extent that they appear 
in similar sentences. This definition is circular. 
Titus, it is applied iteratively using two matrices 
as shown in Figure 2. in this paper, we set the 
number of iterations as 3, as is recommended by 
Karov Y. et al. (1999). 
Sirnilar~~ Word " Sente,,C; : 
i A i d 
Figure 2: llerative computation of word and 
sentence similarities 
In Figure 2, each category has a word 
silnilarity matrix WSM,, and a sentence similarity 
matrix SSM,,. In each iteration n, we update 
WSM,, whose rows and columns are labeled by 
all content words encountered in the 
rcpresentatwe sentences of each category and 
input unclassified sentences. In that lnatrix, the 
cell (i j) hokls a value between 0 and l, 
indicating the extent to which the ith word is 
contextually similar to the jth word. Also, we 
keep and update a SSM,,, which holds similarities 
among sentences. The rows of SSM,, correspond 
to the unclassified sentences and the cohmms to 
the representative sentences. In this paper, the 
number of input sentences of row and column in 
SSM is limited to 200, considering execution 
time and memory allocation. 
To compute tile similarities, we initialize 
WSM, to the identity matrix. That is, each word 
is fully similar (1) to itself and completely 
dissimilar (0) to other words. The following 
steps are iterated until the changes in the 
similarity values are small enough. 
1. Update the sentence similarity lnatrix SSM,,, 
using the word similarity matrix WSM,. 
2. Update the word similarity matrix WSM,,, 
using the sentence similarity matrix SSM,. 
(2) Affinity formulae 
qb simplify tile symmetric iterative treatment of 
similarity between words and sentences, we 
del'ine an auxiliary relation between words and 
sentences as affinity. A woM W is assumed to 
have a certain affinity to every sentence, which 
455 
is a real number between 0 and 1. It reflects the 
contextual relationships between W and the 
words of the sentence. If W belongs to a 
sentence S, its affinity to S is 1. If W is totally 
unrelated to S, the affinity is close to 0. If W is 
contextually similar to the words of S, its affinity 
to S is between 0 and 1. In a similar manner, a 
sentence S has some affinity to every word, 
reflecting the similarity of S to the sentences 
involving that word. 
Affinity formulae are defined as follows 
(Karov Y. et al., 1999). In these formulae, W ~ S 
means that a word belongs to a sentence: 
aft,, (W, S) = max w, es sire,, (W , W i ) 
aff,, (S, W) = max w~s; sire,, (S, S~ ) 
(5) 
(6) 
In the above formulae, n denotes the iteration 
number, and the similarity values are defined by 
WSM,, and SSM,,. Every word has some affinity 
to the sentence, and the sentence can be 
represented by a vector indicating the affinity of 
each word to it. 
(3) Similarity formulae 
The similarity of Wj to W2 is the average affinity 
of the sentences that include W~ to 1+'2, and the 
similarity of a sentence S~ to $2 is a weighted 
average of the affinity of the words in S~ to Se. 
Similarity formulae are defined as follows 
(Karov Y. et al., 1999): 
sim,,+l (Sj, S 2 ) = Z weight(W, S 1 ). qlJ',, (W, S 2 ) (7) 
WE ,~'~ 
if W I =W 2 
sim,,+l (W l , W 2 ) = 1 
C/,?e 
sim"+l (Wl' W2 ) = Z weight(S, W l ). aft,, (S, W 2 ) (8) 
W~eS 
The weights in Formula 7 are computed 
following the methodology in the next section. 
The sum of weights in Formula 8, which is a 
reciprocal number of sentences that contain W, 
is !. These values are used to update the 
corresponding entries of WSM and SSM,,. 
(4) Word weights 
In Formula 7, the weight of a word is a product 
of three factors. It excludes the words that are 
expected to be given unreliable similarity values. 
The weights are not changed in their process of 
iterations. 
l. Global frequency: Frequent words in total 
sentences are less informative of sense and of 
sentence similarity. For example, a word like 
'phil-yo(necessity)' frequently appears in any 
sentence. The formula is as follows (Karov Y. 
et al., 1999): 
max{0,1 freq(W) "1 
max 5, freq(x) J (9) 
In (9), max52\[req(x) is the sum of the five 
highest frequencies in total sentences. 
2.Log-likelihood faclor: In general, the words 
that are indicative of the sense appear in 
representative sentences more frequently than 
in total sentences. The log-likelihood factor 
captures this tendency. It is computed as 
follows (Karov Y. et al., 1999): 
log pr(w; l w) (lO) 
Pr(Wi ) 
In (10), Pr(Wi) is estimated from the 
frequency of Wi in the total sentences, and 
Pr(WilW) fi'om the frequency of Wi in 
representative sentences. To avoid poor 
estimation for words with a low count in 
representative sentences, we nmltiply the log- 
likelihood by (11) where count(Wi) is the 
number of occurrences of Wi in representative 
sentences. For the words which do not appear 
in representative sentences, we assign weight 
(1.0) to them. And the other words are 
assigned weight that adds 1.0 to computed 
value: 
c°unt(Wi) t (11) min. 1, 3 
3.Part of ,q~eech: Each part of speech is 
assigned a weight. We assign weight (1.0) to 
proper noun, non-predicative common noun, 
and foreign word, and assign weight (0.6) to 
active-predicative common noun and stative- 
predicative common noun. 
456 
The total weight of a word is the product of the 
above t'actors, each norlnalized by the sum of 
factors of the words in a sentence as follows 
(Karov Y. et al., 1999): 
,&ctor(Wi, S) weight 
£ J'actor(Wi, S) 
IVieS 
(12) 
In (12), factor(W, S) is the weight before 
normalization. 
(5) Assigning unclassified sentences to a 
category 
We first computed similarities of the 
unclassified sentences to the representative 
sentences. And then, we decided a Silnilarity 
value o1' each unclassified sentence for each 
category using two alternate ways. 
1 tl sint(X,ci)=-- £ 
Sil!l (X,Sj) (13) tiE(' I1 . ( ,'~'jcR,., 
j= J 
sinl(X,ci)=nlnxl.siul(X Si)} (14) (:'~(" I SjcRc, 
In (13) and (14), i) X is au unclassified sentence, 
ii) C = {c l,c2 ..... c,,,} is a category set, and iii) 
R,,,={&,Sa ...... S',,} is a representative sentence 
set of category c.. 
Each unclassified sentence is assigned to a 
category which has a maxinmln similarity wflue. 
But there exist unclassified sentences which do 
not belong to any category. To remove these 
unclassified sentences, we set up a threshold 
value using normal distribution of similarity 
values as follows: 
max{sim(X,ci) } >_ tt + 017 (15) 
ciEC 
In (15), i) X is an unclassified sentence, ii) It is 
an average of similarity wflues, iii) o is a 
standard deviation of similarity wdues, and iv) 0 
is a numerical wdue corresponding to 
threshold(%) in normal distribution table. 
1.3 Feature selection and text classifier 
1.3.1 Feature Selection 
The size of the vocabulary used in our 
experiment is selected by ranking words 
according to their Z 2 statislic with respect to the 
category. Using the two-way contingency table 
of a word t and a category c - i) A is the number 
of times t and c co-occur, ii) B is the number of 
times t occurs without c, iii) C is the number of 
times c occurs without t, iv) D is the number ot' 
times ueither c nor t occurs, and vi) N is the total 
number of sentences - the word-goodness 
measure is defined as follows (Yang Y. et al., 
1997): 
Z2(t,c) = N×(AD-CB)2 (16) 
(A + C)(B + D)(A + B)(C + D) 
To measure the goodness of a word in a 
global feature selection, we combine the 
category-specific scores of a word as follows: 
III 9 2 
ZF,,.,~, (t) = n~a,x{ Z.= (t, (:~)} (17) 
1.3.2 Text classifier 
The method that we use for classifying 
documents is uaivc Bayes, with minor 
modifications based on Kullback-Leibler 
Divergence (Craven M. et al., 1999). The basic 
idea in naive Bayes approaches is to use the joint 
probabilities of words and categories to estimate 
the probabilities of categories given a document. 
Given a document d for chtssit'ication, we 
calculate the probabilities of each category c as 
follows: 
Pr(cld) Pr(c) Pr(d lc) 7' _ _ Pr(c)H Pr(t i Ic.) N(rM~ 
Pr(d) i< 
T !c)) 
,, ;=, Id) ) 
In tile above l'ormula, i) 11, is the number of 
words in d, ii) N(t~ld) is the frequency of woM t i 
in clocument d, iii) 7" is the size of tile 
vocabulary, and iv) t~ is tile ith word in the 
vocabulary. Pr(tAc ) thus represents the 
probability that a randomly drawn woM from a 
randolnly drawn docmnent in category c will be 
the word 6 Pr(tild) represents the proportion of 
woMs in docmnent d that are word t c Each 
probability is estimated by formulae (19) and 
(20), which are called the expected likelihood 
457 
estimator (Li H. et al., 1997). The category 
predicted by the method for a given document is 
simply the category with the greatest score. This 
method performs exactly the same 
classifications as naive Bayes does, but produces 
classification scores that are less extreme. 
N(ti, c) + 0.5 Pr(ti \[ c) = T( (l 9) 
Z N(t i' c) + 0.5 x T c 
j=l 
Pr(tiid) = N(t.i,d)+O.5xT, z (20) 
0 if N(ti,d): 0 
2 Evaluation of experiment 
2.1 Performance measures 
In this paper, a document is assigned to only one 
category. We use the standard definition of 
recall, precision, and F, measure as performance 
measures. For evaluating performance average 
across categories, we use the micro-averaging 
method. F~ measure is defined by the following 
formula (Yang Y. et al., 1997): 
2 q) 
F 1 (r, p) - (21 ) r+ p 
where r represents recall and p precision. It 
balances recall and precision in a way that gives 
them equal weight. 
2.2 Experiment settings 
We used total 47 categories in our experiment. 
They consist of 2,286 documents to be collected 
in web. We did not use tag information of web 
documents. And a so-called bag of words or 
unigraln representation was used. Table 2 shows 
the settings of experiment data in detail. 
Table 2: Setting experiment data 
........................... i i avg# avg # of I #of' #of 
I .... of doc, sen. ..... d0c sen: 
...... ~ .......... ~ inacat, inadoc. 
Training 1,383 67,506 29.4 48.8 Set (60%) 
903 Test Set 56,446 19.2 62.5 
(40%) 
2.3 Prinmry results 
2.3.1 Results of the different combinations of 
similarity value decisions and thresholds 
We evaluated our method according to the 
different combinations of similarity value 
decisions and thresholds in section 1.2.2. We 
used thresholds of top 5%, top 10%, top 15%, 
top20% in formula (15), and tested the two 
options, average and maximum in formulae (13) 
and (14). We limited our vocabulary to 2,000 
words in this experiment. 
+Close Test(max) -~N-~Close Test(avg) 
+Open Test(max) ---')(~Open Test(avo) 
0.73 .................................................................... 
0.72 
UL 0.71 
0.7 
0.69 
E 0.68 
0.67 
0.66 
5% 10% 15% 20% 
Threshold(%) 
Figure 3: Results of the different combinations of 
similarity wdue decisions and thresholds 
Figure 3 shows results according to the two 
options in each threshold. Here, the result using 
maxinmm was better than that using average 
with regrad to all thresholds. The results of top 
10% and top 15% were best. Therefore, we used 
the maximum in the decision of similarity value 
and top 15% in threshold in our experiments. 
2.3.2 The proposed system vs. the system by 
supervised learning 
For the fair evaluation, we embodied a 
traditional system by supervised learning using 
the same feature selection method (2/ statistic) 
and classifier (naive Bayes Classifier), as used in 
the proposed system. And we tested these 
systems and compared their performance: 
458 
+method by supervised learning '-~tr-proposed method 
0.8 ..................................................................................................... 
0.770 
0.75 
0.725 
E. o7 
>~ 0.675 
0.05 
E 0.0 
0675 
0.!i5 
0,525 
6.5 .................................................................................................... 
Vooabulary Size 
lqgure 4: Comparison of the proposed system and the 
syslem by supervised learning 
Figure 4 displays the performance curves for the 
proposed system and the system by supervised 
learning. The best F~ score of the proposed 
system is 71.8% and that of the system by 
supervised learning is 75.6%. Therefore, the 
difference between them is only 3.8%. 
Conclusion 
This paper has described a new automatic text 
categorization method. This method 
automatically created training sets using 
keyword lists of each category and used them 
for training. And then, it classified text 
documents. This could be a significant method 
in text learning because of the high cost of hand- 
labeling training docmnents and the awfilability 
of huge volumes of unlabeled documents. The 
experiment results showed that with respect to 
performance, the difference between the 
proposed method and the method by supervised 
learning is insignificant. Therefore, this method 
can be used in areas where low-cost text 
categorization is required, and can be used for 
creating training data. 
This study awaits further research. First, a 
more scientific approach for defining keyword 
lists should be investigated. Next, if we use a 
word sense disambiguation systeln in the 
extraction step of representative sentences, we 
would be able to achieve a better performance. 
Acknowledgments 
This work was supported by KOSEF under 
Grant No. 97-0102-03-01-3. We wish to thank 
Jeoung-seok Kiln for his valuable COlnments to 
the earlier version of this paper. 

References 

Cho K. and Kim J. (1997) Automatic Text 
Categorization on Hierarchical Category Structure 
by using ICF(Inverted Category Frequency) 
Weighting. In Proceedings of KISS coqference, 
pp.507-510. 

Craven M., DiPasquo D., Freitag I)., McCallum A., 
Mitchell T., Nigam K. and Slauery S. (1999) 
l,earning to Conslruct Knowledge Bases from lhe 
World Wide Web. to appear in Artificial 
httelligence. 

Joachims T. (1998)Text Categorization with Supporl 
Vector Machines: Learning with Many Relevant 
Features. In European Conference on Machine 
Learning(ECML). 

Karov Y. and 17,dehnan S. (1998) Similarity-based 
Word Sense l)isambiguation. Computational 
Linguistics, Vol 24, No I, pp. 41-60. 

Kim H., KEY., Park S. and See J. (1999) hfformal 
Requirements Analysis St,1)porling System for 
Htlulall Engineer. 111 Proceedings of Conference m~ 
IEEE- SMC99. Vol 3, pp. 1013-1018. 

Lewis D.D. and Ringuette M. (t994) A comparison 
of '\['wo 1,earning Algorithms for Text categorizalion. 
In Proceeding of the 3 ''~ Ammal &,ml~osium o~ 
Document Attalysis and h{/brmation Retrieval. 

Lewis I).D., Schapire P,.E., Calhm J.P. and Papka 
P,.(1996) Training Algorilhms for IJnear Text 
Classifiers. In Proceedings of the 19" htter~tatiomtl 
Coqference on Research and Deveh)l)ment in 
h!/btwtation Retrieval (SIGIR'96), pp. 289-297. 

Li H. and Yamanishi K. (1997) Document 
Classification Using a Finite Mixture Model. The 
Association for Co,qmtatiomtl Littguistics, 
ACE '97. 

McCallum A. and Nigram K. (1998) A comparison 
of Event Models for Naive Bayes Text 
Classification. AAAI '98 workshop on Leanting for 
7kvt Categorization. 

Nigam K., McCallum A., Thrun S. and Mitchell T. 
(1998) Learning to Classify Text from Labeled and 
Unlabeled l)oeuments. In Proceedings of 15" 
National Conference on Artificial httelligence 
(AAAI-98). 

Yang Y. (1999) An ewduation of statistical 
approaches to text categol"ization, ht.formation 
Retrieval Journal, May. 

Yaug Y. (1994) Expert netword: Effective and 
efficient learning fi'om human decisions in text 
catego,izatin and retriewfl. In 17" Ammal 
lnternatiomd A CM SIG1R Conference on Research 
and Development in hi formation Retrieval 
(SIGIR'94), pp. 13-22. 

Yang Y. and Pederson J.O. (1997) A comparative 
study on feature selection in text categorization, ht 
Proceedings of the 14" International Conference on 
Machine Learning. 
