File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1117_metho.xml
Size: 3,861 bytes
Last Modified: 2025-10-06 14:08:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1117"> <Title>Keyword-based Document Clustering</Title> <Section position="2" start_page="1" end_page="1" type="metho"> <SectionTitle> Keywords: Document Clustering, Weighting Scheme, Feature Selection 1 Introduction </SectionTitle> <Paragraph position="0"> Document clustering is an aggregation of documents by discriminating the relevant documents from the irrelevant documents. The relevance determination criteria of any two documents is a similarity measure and the representatives of the documents [1,2,3,4]. There are some similarity measures such as Dice coefficient, Jaccard's coefficient, and cosine measure. These similarity measures require that the documents are represented in document vectors and the similarity of two documents is calculated from the operation of document vectors.</Paragraph> <Paragraph position="1"> In general, the representatives of a document or a cluster are document vectors that consist of <term, weight> pairs and the document similarities are determined by the terms and their weighting values that are extracted from the document [7,9]. In the previous studies on the document clustering, we focused on the clustering algorithm, but the document issue. Document vectors are simply constructed from the term frequency (TF) and the inverted document frequency (IDF). This representation of term weighting method starts from the precondition that terms or keywords representing the document are calculated by TF-IDF. Term weighting method by TF-IDF is generally used to construct a document vector, but we cannot say that it is the best way of representing a document. So, we suppose that there is a limitation to improve the accuracy of the clustering system only by improving the clustering algorithm without changing the document/cluster representation method.</Paragraph> <Paragraph position="2"> Also, document clustering requires a large amount of memory spaces to keep the representatives of documents/clusters and the similarity measures [6, 8, 10]. Given N documents to be clustered, N x N similarity matrix is needed to store document similarity measures. Also, the recursive iteration of similarity calculation and reconstructing the representative of the clusters need a huge number of computations.</Paragraph> <Paragraph position="3"> In this paper, we propose a new clustering method that is based on the keyword weighting approach. The clustering algorithm starts from the seed documents and the cluster is expanded by the keyword relationship. The evolution of the cluster stops when no more documents are added to the cluster and irrelevant documents are removed from the cluster candidates.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Keyword-based Weighting Scheme </SectionTitle> <Paragraph position="0"> In general, the construction of a document vector depends on the term frequency and document frequency. If keywords are determined by frequency information of the document, we are apt to generate an error that nouns are often used regardless of substance of the document and the words of a high frequency are extracted. The clustering method, which is focused on similarity calculation considers the whole words except stopwords as the representative of the document, and constitutes a document vector that is calculated by the weight value from the term frequency and document frequency.</Paragraph> <Paragraph position="1"> It is common that terms and their weight values represent a document and <term, weight> pairs are the unique elements of the document vector. When we construct a document vector, term frequency and document frequency are the most important features to calculate the weight of a term. As for the terms and</Paragraph> </Section> class="xml-element"></Paper>