File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1061_metho.xml

Size: 11,168 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1061">
  <Title>Language Model-Based Document Clustering Using Random Walks</Title>
  <Section position="5" start_page="481" end_page="483" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluated our new vector representation by comparing it against the traditional a1a2 a3 a4a5a2 vector space representation. We ran k-means, single-link, average-link, and complete-link clustering algorithms on various data sets using both representations. These algorithms are among the most popular ones that are used in document clustering. null</Paragraph>
    <Section position="1" start_page="481" end_page="482" type="sub_section">
      <SectionTitle>
4.1 General Experimental Setting
</SectionTitle>
      <Paragraph position="0"> Given a corpus, we stemmed all the documents, removed the stopwords and constructed the a1a2 a3a4a5a2 vector for each document by using the bow toolkit (McCallum, 1996).</Paragraph>
      <Paragraph position="1"> We computed the a4a5a2 of each term using the following formula:</Paragraph>
      <Paragraph position="3"> where a0 is the total number of documents and dfa15a18 a17 is the number of documents that the term a18 appears in.</Paragraph>
      <Paragraph position="4"> We computed flattened generation probabilities (Equation 2) for all ordered pairs of documents in a corpus, and then constructed the corresponding generation graph (Definition 1). We used Dirichlet-smoothed language models with the smoothing parameter a42 a24 a35a56a56a56, which can be considered as a typical value used in information retrieval. While computing the generation link vectors, we did not perform extensive parameter tuning at any stage of our method. However, we observed the following: null a14 When a57 (number of outgoing links per document) was very small (less than 10), our methods performed poorly. This is expected with such a sparse vector representation for documents. However, the performance got rapidly and almost monotonically  better as we increased a57 until around a57 a24 a98a56, where the performance stabilized and dropped after around  a35a56a56. We conclude that using bounded number of outgoing links per document is not only more efficient but also necessary as we motivated in Section 2.2.</Paragraph>
      <Paragraph position="5"> a14 We got the best results when the random walk parameter a1 a24 a99. When a1 a73 a99, the random walk goes &amp;quot;out of the cluster&amp;quot; and a49 a89a90a71 vectors become very dense. In other words, almost all of the graph is reachable from a given node with 4-step or longer random walks (assuming a57 is around 80), which is an indication of a &amp;quot;small world&amp;quot; effect in generation graphs (Watts and Strogatz, 1998).</Paragraph>
      <Paragraph position="6"> Under these observations, we will only report results using vectors a49 a89a90 a51, a49a89a90a52 and a49 a89a90 a100 with a57 a24 a98a56 regardless of the data set and the clustering algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="482" end_page="483" type="sub_section">
      <SectionTitle>
4.2 Experiments with k-means
</SectionTitle>
      <Paragraph position="0"> k-means is a clustering algorithm popular for its simplicity and efficiency. It requires a101, the number of clusters, as input, and partitions the data set into exactly a101 clusters. We used a version of k-means that uses cosine similarity to compute the distance between the vectors.</Paragraph>
      <Paragraph position="1"> The algorithm can be summarized as follows:  1. randomly select a101 document vectors as the initial cluster centroids; 2. assign each document to the cluster whose centroid yields the highest cosine similarity; 3. recompute the centroid of each cluster. (centroid vector of a cluster is the average of the vectors in that cluster); 4. stop if none of the centroid vectors has changed at step 3. otherwise go to step 2.</Paragraph>
      <Paragraph position="2">  k-means is known to work better on data sets in which the documents are nearly evenly distributed among different clusters. For this reason, we tried to pick such corpora for this experiment to be able to get a fair comparison between different document representations. The first corpus we used is classic3,1 which is a collection of technical paper abstracts in three different areas. We used two corpora, bbc and bbcsport, that are composed 1ftp://ftp.cs.cornell.edu/pub/smart of BBC news articles in general and sports news, respectively. 2 Both corpora have 5 news classes each. 20news3 is a corpus of newsgroup articles composed of 20 classes. Table 1 summarizes the corpora we used together with the sizes of the smallest and largest class in each of them.  We used two different metrics to evaluate the results of the k-means algorithm; accuracy and mutual information. Let a95a9 be the label assigned to a5a9 by the clustering algorithm, and a102 a9 be a5a9's actual label in the corpus. Then,</Paragraph>
      <Paragraph position="4"> a24 a105 and equals zero otherwise. mapa15a95a9 a17 is the function that maps the output label set of the k-means algorithm to the actual label set of the corpus. Given the confusion matrix of the output, best such mapping function can be efficiently found by Munkres's algorithm (Munkres, 1957).</Paragraph>
      <Paragraph position="5"> Mutual information is a metric that does not require a mapping function. Let a106</Paragraph>
      <Paragraph position="7"> a67 be the actual label set of the corpus with the underlying assignments of documents to these sets. Mutual information (MI) of these two labelings is defined as:</Paragraph>
      <Paragraph position="9"> a13 a17 are the probabilities that a document is labeled as a95a9 and a102a13 by the algorithm and in the actual corpus, respectively;</Paragraph>
      <Paragraph position="11"> that these two events occur at the same time. These values can be derived from the confusion matrix. We map the MI metric to the a114a56  a35a115 interval by normalizing it with the maximum possible MI that can be achieved with the corpus. Normalized MI is defined as</Paragraph>
      <Paragraph position="13"> research/datasets.html BBC corpora came in preprocessed format so that we did not perform the processing with the bow toolkit mentioned in Section 4.1  One disadvantage of k-means is that its performance is very dependent on the initial selection of cluster centroids. Two approaches are usually used when reporting the performance of k-means. The algorithm is run multiple times; then either the average performance of these runs or the best performance achieved is reported. Reporting the best performance is not very realistic since we would not be clustering a corpus if we already knew the class labels. Reporting the average may not be very informative since the variance of multiple runs is usually large. We adopt an approach that is somewhere in between. We use &amp;quot;true seeds&amp;quot; to initialize k-means, that is, we randomly select a101 document vectors that belong to each of the true classes as the initial centroids. This is not an unrealistic assumption since we initially know the number of classes, a101, in the corpus, and the cost of finding one example document from each class is not usually high. This way, we also aim to reduce the variance of the performance of different runs for a better analysis.</Paragraph>
      <Paragraph position="14"> Table 2 shows the results of k-means algorithm using a1a2 a3a4a5a2 vectors versus generation vectors a49a89a90 a51 (plain flattened generation probabilities), a49 a89a90 a52 (2-step random walks), a49a89a90 a100 (3-step random walks). Taking advantage of the relatively larger size and number of classes of 20news corpus, we randomly divided it into disjoint partitions with 4, 5, and 10 classes which provided us with 5, 4, and 2 new corpora, respectively. We named them 4news-1, 4news-2, a53 a53 a53, 10news-2 for clarity. We ran k-means with 30 distinct initial seed sets for each corpus.</Paragraph>
      <Paragraph position="15"> The first observation we draw from Table 2 is that even a49 a89a90 a51 vectors perform better than the a1a2a3a4a5a2 model. This is particularly surprising given that a49 a89a90 a51 vectors are sparser than the a1a2 a3a4a5a2 representation for most documents.4 All a49 a89a90 a71 vectors clearly outperform a1a2 a3a4a5a2 model often by a wide margin. The performance also gets better (not always significantly though) in almost all data sets as we increase the random walk length, which indicates that random walks are useful in reinforcing generation links and inducing new relationships. Another interesting observation is that the confidence intervals are also narrower for generation vectors, and tend to get even narrower as we increase a1.</Paragraph>
    </Section>
    <Section position="3" start_page="483" end_page="483" type="sub_section">
      <SectionTitle>
4.3 Experiments with Hierarchical Clustering
</SectionTitle>
      <Paragraph position="0"> Hierarchical clustering algorithms start with the trivial clustering of the corpus where each document defines a separate cluster by itself. At each iteration, two &amp;quot;most similar&amp;quot; separate clusters are merged. The algorithm stops after a0 a36 a35 iterations when all the documents 4Remember that we set a117 a118 a119a120 in our experiments which means that there can be a maximum of 80 non-zero elements in a121a122a123 a124. Most documents have more than 80 unique terms in them.</Paragraph>
      <Paragraph position="1"> are merged into a single cluster.</Paragraph>
      <Paragraph position="2"> Hierarchical clustering algorithms differ in how they define the similarity between two clusters at each merging step. We experimented with three of the most popular algorithms using cosine as the similarity metric between two vectors. Single-link clustering merges two clusters whose most similar members have the highest similarity.</Paragraph>
      <Paragraph position="3"> Complete-link clustering merges two clusters whose least similar members have the highest similarity. Average-link clustering merges two clusters that yield the highest average similarity between all pairs of documents.</Paragraph>
      <Paragraph position="4">  iments.</Paragraph>
      <Paragraph position="5"> Although hierarchical algorithms are not very efficient, they are useful when the documents are not evenly distributed among the classes in the corpus and some classes exhibit a &amp;quot;hierarchical&amp;quot; nature; that is, some classes in the data might be semantically overlapping or they might be in a subset/superset relation with each other. We picked two corpora that may exhibit such nature to a certain extent. Reuters-215785 is a collection of news articles from Reuters. TDT26 is a similar corpus of news articles collected from six news agencies in 1998. They contain documents labeled with zero, one or more class labels. For each corpus, we used only the documents with exactly one label. We also eliminated classes with only one document since clustering such classes is trivial. We ended up with two collections summarized in Table 3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML