File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1144_intro.xml
Size: 3,738 bytes
Last Modified: 2025-10-06 14:01:23
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1144"> <Title>Concept Discovery from Text</Title> <Section position="3" start_page="0" end_page="2" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Clustering algorithms are generally categorized as hierarchical and partitional. In hierarchical agglomerative algorithms, clusters are constructed by iteratively merging the most similar clusters. These algorithms differ in how they compute cluster similarity. In single-link clustering, the similarity between two clusters is the similarity between their most similar members while complete-link clustering uses the similarity between their least similar members.</Paragraph> <Paragraph position="1"> Average-link clustering computes this similarity as the average similarity between all pairs of elements across clusters. The complexity of these algorithms is O(n logn), where n is the number of elements to be clustered (Jain, Murty, Flynn 1999).</Paragraph> <Paragraph position="2"> Chameleon is a hierarchical algorithm that employs dynamic modeling to improve clustering quality (Karypis, Han, Kumar 1999). When merging two clusters, one might consider the sum of the similarities between pairs of elements across the clusters (e.g. average-link clustering). A drawback of this approach is that the existence of a single pair of very similar elements might unduly cause the merger of two clusters. An alternative considers the number of pairs of elements whose similarity exceeds a certain threshold (Guha, Rastogi, Kyuseok 1998). However, this may cause undesirable mergers when there are a large number of pairs whose similarities barely exceed the threshold. Chameleon clustering combines the two approaches.</Paragraph> <Paragraph position="3"> K-means clustering is often used on large data sets since its complexity is linear in n, the number of elements to be clustered. K-means is a family of partitional clustering algorithms that iteratively assigns each element to one of K clusters according to the centroid closest to it and recomputes the centroid of each cluster as the average of the clusters elements. K-means has complexity O(KxTxn) and is efficient for many clustering tasks. Because the initial centroids are randomly selected, the resulting clusters vary in quality. Some sets of initial centroids lead to poor convergence rates or poor cluster quality.</Paragraph> <Paragraph position="4"> Bisecting K-means (Steinbach, Karypis, Kumar 2000), a variation of K-means, begins with a set containing one large cluster consisting of every element and iteratively picks the largest cluster in the set, splits it into two clusters and replaces it by the split clusters. Splitting a cluster consists of applying the basic K-means algorithm a times with K=2 and keeping the split that has the highest average elementcentroid similarity.</Paragraph> <Paragraph position="5"> Hybrid clustering algorithms combine hierarchical and partitional algorithms in an attempt to have the high quality of hierarchical algorithms with the efficiency of partitional algorithms. Buckshot (Cutting, Karger, Pedersen, Tukey 1992) addresses the problem of randomly selecting initial centroids in K-means by combining it with average-link clustering.</Paragraph> <Paragraph position="6"> Buckshot first applies average-link to a random sample of n elements to generate K clusters. It then uses the centroids of the clusters as the initial K centroids of K-means clustering. The sample size counterbalances the quadratic running time of average-link to make Buckshot efficient: O(KxTxn + nlogn). The parameters K and T are usually considered to be small numbers.</Paragraph> </Section> class="xml-element"></Paper>