Keyword-based Document Clustering 
 
Seung-Shik Kang 
School of Computer Science, Kookmin University & AITrc 
Chungnung-dong, Songbuk-gu, Seoul 136-702, Korea 
sskang@kookmin.ac.kr 
 
Abstract
1
  
Document clustering is an aggregation of 
related documents to a cluster based on the 
similarity evaluation task between documents and 
the representatives of clusters. Terms and their 
discriminating features of terms are the clue to 
the clustering and the discriminating features are 
based on the term and document frequencies. 
Feature selection method on the basis of 
frequency statistics has a limitation to the 
enhancement of the clustering algorithm because 
it does not consider the contents of the cluster 
objects. In this paper, we adopt a content-based 
analytic approach to refine the similarity 
computation and propose a keyword-based 
clustering algorithm. Experimental results show 
that content-based keyword weighting 
outperforms frequency-based weighting method. 
Keywords: Document Clustering, Weighting 
Scheme, Feature Selection 
 
1 Introduction 
Document clustering is an aggregation of 
documents by discriminating the relevant documents 
from the irrelevant documents. The relevance 
determination criteria of any two documents is a 
similarity measure and the representatives of the 
documents [1,2,3,4]. There are some similarity 
measures such as Dice coefficient, Jaccard’s 
coefficient, and cosine measure. These similarity 
measures require that the documents are represented 
in document vectors and the similarity of two 
documents is calculated from the operation of 
document vectors. 
In general, the representatives of a document or a 
cluster are document vectors that consist of <term, 
weight> pairs and the document similarities are 
determined by the terms and their weighting values 
that are extracted from the document [7,9]. In the 
previous studies on the document clustering, we 
focused on the clustering algorithm, but the document 
                                                           
This work was supported by the Korea Science and Engineering 
Foundation(KOSEF) through the Advanced Information 
Technology Research Center(AITrc). 
representation methodology was not the important 
issue. Document vectors are simply constructed from 
the term frequency (TF) and the inverted document 
frequency (IDF). This representation of term weighting 
method starts from the precondition that terms or 
keywords representing the document are calculated by 
TF-IDF. Term weighting method by TF-IDF is 
generally used to construct a document vector, but we 
cannot say that it is the best way of representing a 
document. So, we suppose that there is a limitation to 
improve the accuracy of the clustering system only by 
improving the clustering algorithm without changing 
the document/cluster representation method. 
Also, document clustering requires a large amount of 
memory spaces to keep the representatives of 
documents/clusters and the similarity measures [6, 8, 
10]. Given N documents to be clustered, N × N 
similarity matrix is needed to store document similarity 
measures. Also, the recursive iteration of similarity 
calculation and reconstructing the representative of the 
clusters need a huge number of computations. 
In this paper, we propose a new clustering method 
that is based on the keyword weighting approach. The 
clustering algorithm starts from the seed documents 
and the cluster is expanded by the keyword relationship. 
The evolution of the cluster stops when no more 
documents are added to the cluster and irrelevant 
documents are removed from the cluster candidates. 
2 Keyword-based Weighting Scheme 
In general, the construction of a document vector 
depends on the term frequency and document 
frequency. If keywords are determined by frequency 
information of the document, we are apt to generate an 
error that nouns are often used regardless of substance 
of the document and the words of a high frequency are 
extracted. The clustering method, which is focused on 
similarity calculation considers the whole words except 
stopwords as the representative of the document, and 
constitutes a document vector that is calculated by the 
weight value from the term frequency and document 
frequency. 
It is common that terms and their weight values 
represent a document and <term, weight> pairs are the 
unique elements of the document vector. When we 
construct a document vector, term frequency and 
document frequency are the most important features to 
calculate the weight of a term. As for the terms and 

References

Anderberg, M. R., Cluster Analysis for Applications, New York: Academic, 1973.

Can, F., and E. A. Ozkarahan, Dynamic Cluster Maintenance, Information Processing & Management, Vol. 25, pp.275-291, 1989.

Dubes, R., and A. K. Jain, Clustering Methodologies in Exploratory Data Analysis, Advances in Computers, Vol. 19, pp.113-227, 1980.

Frakes, W. B. and R. Baeza-Yates, Information Retrieval, Prentice Hall, 1992.

Kang, S. S., H. G. Lee, S. H. Son, G. C. Hong, and B. J. Moon, Term Weighting Method by Postposition and Compound Noun Recognition, Proceedings of 13th Conference on Korean Language Computing, pp.196-198, 2001.

Murtagh, F., Complexities of Hierarchic Clustering Algorithms: State of the Art, Computational Statistics Quarterly, Vol. 1, pp.101-113, 1984.

Perry, S. A., and P. Willett, A Review of the Use of Inverted Files for Best Match Searching in Information Retrieval Systems, Journal of Information Science, Vol. 6, pp.59-66, 1983.

Sibson, R. SLINK: an Optimally Efficient Algorithm for the Single-Link Cluster Method, Computer Journal, Vol. 16, pp.328-342, 1973.

Willett, P., Document Clustering Using an Inverted File Approach, Journal of Information Science, Vol. 2, pp.223-231, 1980.

Willett, P., Recent Trends in Hierarchic Document Clustering: A Critical Review, Information Processing and Management, Vol. 24, No.5, pp.577- 597, 1988.
