Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 479–486,
New York, June 2006. c©2006 Association for Computational Linguistics
Language Model-Based Document Clustering Using Random Walks
G¨unes¸ Erkan
Department of EECS
University of Michigan
Ann Arbor, MI 48109-2121
gerkan@umich.edu
Abstract
We propose a new document vector represen-
tation specifically designed for the document
clustering task. Instead of the traditional term-
based vectors, a document is represented as an
a0-dimensional vector, where a0 is the number of
documents in the cluster. The value at each di-
mension of the vector is closely related to the
generation probability based on the language
model of the corresponding document. In-
spired by the recent graph-based NLP methods,
we reinforce the generation probabilities by it-
erating random walks on the underlying graph
representation. Experiments with k-means and
hierarchical clustering algorithms show signif-
icant improvements over the alternative a1a2 a3a4a5a2
vector representation.
1 Introduction
Document clustering is one of the oldest and most studied
problems of information retrieval (van Rijsbergen, 1979).
Almost all document clustering approaches to date have
represented documents as vectors in a bag-of-words vec-
tor space model, where each dimension of a document
vector corresponds to a term in the corpus (Salton and
McGill, 1983). General clustering algorithms are then
applied to these vectors to cluster the given corpus. There
have been attempts to use bigrams or even higher-order n-
grams to represent documents in text categorization, the
supervised counterpart of document clustering, with little
success (Caropreso et al., 2001; Tan et al., 2002).
Clustering can be viewed as partitioning a set of data
objects into groups such that the similarities between the
objects in a same group is high while inter-group simi-
larities are weaker. The fundamental assumption in this
work is that the documents that are likely to have been
generated from similar language models are likely to be
in the same cluster. Under this assumption, we propose a
new representation for document vectors specifically de-
signed for clustering purposes.
Given a corpus, we are interested in the generation
probabilities of a document based on the language models
induced by other documents in the corpus. Using these
probabilities, we propose a vector representation where
each dimension of a document vector corresponds to a
document in the corpus instead of a term in the classical
representation. In other words, our document vectors are
a0-dimensional, where a0 is the number of documents in
the corpus to be clustered. For the vector a6 a7a8 of docu-
ment a5a9, the a10 th element of a6 a11 a12 is closely related to the
generation probability of a5a9 based on the language model
induced by document a5a13 . The main steps of our method
are as follows:
a14 For each ordered document pair
a15
a5a9 a16 a5a13 a17 in a given
corpus, we compute the generation probability of
a5a9 from the language model induced by a5a13 mak-
ing use of language-model approaches in informa-
tion retrieval (Ponte and Croft, 1998).
a14 We represent each document by a vector of its gen-
eration probabilities based on other documents’ lan-
guage models. At this point, these vectors can be
used in any clustering algorithm instead of the tradi-
tional term-based document vectors.
a14 Following (Kurland and Lee, 2005), our new doc-
ument vectors are used to construct the underlying
generation graph; the directed graph where docu-
ments are the nodes and link weights are propor-
tional to the generation probabilities.
a14 We use restricted random walk probabilities to rein-
force the generation probabilities and discover hid-
den relationships in the graph that are not obvious
by the generation links. Our random walk model
is similar to the one proposed by Harel and Kohen
479
(2001) for general spatial data represented as undi-
rected graphs. We have extended their model to the
directed graph case. We use new probabilities de-
rived from random walks as the vector representa-
tion of the documents.
2 Generation Probabilities as Document
Vectors
2.1 Language Models
The language modeling approach to information retrieval
was first introduced by Ponte and Croft (1998) as an al-
ternative (or an improvement) to the traditional a1a2 a3 a4a5a2
relevance models. In the language modeling framework,
each document in the database defines a language model.
The relevance of a document to a given query is ranked
according to the generation probability of the query based
on the underlying language model of the document. To
induce a (unigram) language model from a document, we
start with the maximum likelihood (ML) estimation of
the term probabilities. For each term a18 that occurs in a
document a19 , the ML estimation of a18 with respect to a19
is defined as
a20 a21 a22
a15a18 a23a19
a17 a24 a25a26 a15a18
a16
a19
a17
a27 a28 a29 a30a31
a25a26 a15a18 a32
a16
a19
a17
where a25a26 a15a18 a16 a19 a17 is the number of occurences of term a18 in
document a19 . This estimation is often smoothed based on
the following general formula:
a20
a15a18 a23a19
a17 a24 a33
a20 a21 a22
a15a18 a23a19
a17 a34
a15a35 a36
a33 a17
a20 a21 a22
a15a18 a23
a37 a38a39
a20 a40 a41
a17
where a20 a21 a22 a15a18 a23a37 a38a39a20 a40 a41a17 is the ML estimation of a18 over
an entire corpus which usually a19 is a member of. a33 is the
general smoothing parameter that takes different forms
in various smoothing methods. Smoothing has two im-
portant roles (Zhai and Lafferty, 2004). First, it accounts
for terms unseen in the document preventing zero prob-
abilities. This is similar to the smoothing effect in NLP
problems such as parsing. Second, smoothing has an a4a5a2 -
like effect that accounts for the generation probabilities of
the common terms in the corpus. A common smoothing
technique is to use Bayesian smoothing with the Dirichlet
prior (Zhai and Lafferty, 2004; Liu and Croft, 2004):
a33 a24
a27 a28 a29 a30a31 tf
a15a18 a32
a16
a19
a17
a27 a28 a29 a30a31 tf
a15a18 a32
a16
a19
a17 a34 a42
Here, a42 is the smoothing parameter. Higher values of a42
mean more aggressive smoothing.
Assuming the terms in a text are independent from
each other, the generation probability of a text sequence
a43 given the document
a19 is the product of the generation
probabilities of the terms of a43 :
a20
a15
a43
a23a19
a17 a24 a44
a28 a30a45
a20
a15a18 a23a19
a17 (1)
In the context of information retrieval, a43 is a query
usually composed of few terms. In this work, we are
interested in the generation probabilities of entire docu-
ments that usually have in the order of hundreds of unique
terms. If we use Equation 1, we end up having unnatural
probabilities which are irrepresentably small and cause
floating point underflow. More importantly, longer docu-
ments tend to have much smaller generation probabilities
no matter how closely related they are to the generating
language model. However, as we are interested in the
generation probabilities between all pairs of documents,
we want to be able to compare two different generation
probabilities from a fixed language model regardless of
the target document sizes. This is not a problem in the
classical document retrieval setting since the given query
is fixed, and generation probabilities for different queries
are not compared against each other. To address these
problems, following (Lavrenko et al., 2002; Kurland and
Lee, 2005), we “flatten” the probabilities by normalizing
them with respect to the document size:
a20 flat
a15
a43
a23a19
a17 a24
a20
a15
a43
a23a19
a17 a46a47a48 a47 (2)
where a23
a43
a23 is the number of terms in a43 . a20 flat provides
us with meaningful values which are comparable among
documents of different sizes.
2.2 Using Generation Probabilities as Document
Representations
Equation 2 suggests a representation of the relation-
ship of a document with the other documents in a
corpus. Given a corpus of a0 documents to cluster,
we form an a0-dimensional generation vector a49 a7a8 a24
a15
a50
a7a8 a51
a16
a50
a7a8a52
a16 a53 a53 a53 a16
a50
a7a8a54
a17 for each document a5a9 where
a50
a7a8
a13 a24 a55 a56 if
a4 a24
a10
a16
a20 flat
a15
a5a9
a23
a5a13 a17 otherwise (3)
We can use these generation vectors in any clustering
algorithm we prefer instead of the classical term-based
a1a2 a3 a4a5a2 vectors. The intuition behind this idea becomes
clearer when we consider the underlying directed graph
representation, where each document is a node and the
weight of the link from a5a9 to a5a13 is equal to a20 flat a15a5a9 a23a5a13 a17.
An appropriate analogy here is the citation graph of sci-
entific papers. The generation graph can be viewed as a
model where documents cite each other. However, un-
like real citations, the generation links are weighted and
automatically induced from the content.
The similarity function used in a clustering algorithm
over the generation vectors becomes a measure of struc-
tural similarity of two nodes in the generation graph.
Work on bibliometrics uses various similarity metrics to
assess the relatedness of scientific papers by looking at
the citation vectors (Boyack et al., 2005). Graph-based
480
similarity metrics are also used to detect semantic simi-
larity of two documents on the Web (Maguitman et al.,
2005). Cosine, also the standard metric used in a1a2 a3 a4a5a2
based document clustering, is one of these metrics. In-
tuitively, the cosine of the citation vectors (i.e. vector of
outgoing link weights) of two nodes is high when they
link to similar sets of nodes with similar link weights.
Hence, the cosine of two generation vectors is a measure
of how likely two documents are generated from the same
documents’ language models.
The generation probability in Equation 2 with a
smoothed language model is never zero. This creates two
potential problems if we want to use the vector of Equa-
tion 3 directly in a clustering algorithm. First, we only
want strong generation links to contribute in the similar-
ity function since a low generation probability is not an
evidence for semantic relatedness. This intuition is sim-
ilar to throwing out the stopwords from the documents
before constructing the a1a2 a3 a4a5a2 vectors to avoid coinci-
dental similarities between documents. Second, having
a dense vector with lots of non-zero elements will cause
efficiency problems. Vector length is assumed to be a
constant factor in analyzing the complexity of the clus-
tering algorithms. However, our generation vectors are
a0-dimensional, where a0 is the number of documents. In
other words, vector size is not a constant factor anymore,
which causes a problem of scalability to large data sets.
To address these problems, we use what Kurland and Lee
(2005) define as top generators: Given a document a5a9,
we consider only a57 documents that yield the largest gen-
eration probabilities and discard others. The resultant a0-
dimensional vector, denoted a49 a58
a7a8
, has at most a57 non-zero
elements, which are the largest a57 elements of a49 a7a8. For a
given constant a57, with a sparse vector representation, cer-
tain operations (e.g. cosine) on such vectors can be done
in constant time independent of a0.
2.3 Reinforcing Links with Random Walks
Generation probabilities are only an approximation of se-
mantic relatedness. Using the underlying directed graph
interpretation of the generation probabilities, we aim to
get better approximations by accumulating the generation
link information in the graph. We start with some defini-
tions. We denote a (directed) graph as a59 a15a60 a16 a18 a17 where
a60 is the set of nodes and a18 a61 a60 a62 a60 a63 a64 is the link
weight function. We formally define a generation graph
as follows:
Definition 1 Given a corpus a65 a24 a66a5 a51 a16 a5a52 a16 a53 a53 a53 a16 a5a54 a67 with
a0 documents, and a constant
a57, the generation graph of a65
is a directed graph a59
a58
a15a65
a16
a18
a17, where
a18 a15
a5a9 a16 a5a13 a17 a24
a50
a58
a7a8
a13 .
Definition 2 A a1-step random walk on a graph a59 a15a60 a16 a18 a17
that starts at node a68a69 a70 a60 is a sequence of nodes
a68a69
a16
a68 a51
a16 a53 a53 a53 a16
a68a71 a70 a60 where a18 a15a68
a9 a16
a68
a9a72
a51
a17 a73
a56 for all a56 a74
a4 a75 a1. The probability of a a1-step random walk is defined
as a76 a71 a77a51a9a78
a69 a79a80
a8
a80
a8a81
a46
where
a79a80
a8
a80
a8a81
a46
a24 a18 a15a68
a9 a16
a68
a9a72
a51
a17
a27 a82 a30a83
a18 a15a68
a9 a16
a40
a17
a79
a82
a80
is called the transition probability from node a40 to
node a68.
For example, for a generation graph a59
a58
, there are at most
a57 1-step random walks that start at a given node with
probabilities proportional to the weights of the outgoing
generation links of that node.
Suppose there are three documents a84, a85 , and a37 in a
generation graph. Suppose also that there are “strong”
generation links from a84 to a85 and a85 to a37 , but no link
from a84 to a37 . The intuition says that a84 must be semanti-
cally related to a37 to a certain degree although there is no
generation link between them depending on a37 ’s language
model. We approximate this relation by considering the
probabilities of 2-step (or longer) random walks from a84
to a37 although there is no 1-step random walk from a84 to
a37 .
Let
a79
a71
a82
a80
denote the probability that an a1-step random
walk starts at a40 and ends at a68. An interesting property
of random walks is that for a given node a68,
a79 a86
a82
a80
does not
depend on a40. In other words, the probability of a random
walk ending up at a68 “in the long run” does not depend
on its starting point (Seneta, 1981). This limiting prob-
ability distribution of an infinite random walk over the
nodes is called the stationary distribution of the graph.
The stationary distribution is uninteresting to us for clus-
tering purposes since it gives an information related to the
global structure of the graph. It is often used as a measure
to rank the structural importance of the nodes in a graph
(Brin and Page, 1998). For clustering, we are more inter-
ested in the local similarities inside a “cluster” of nodes
that separate them from the rest of the graph. Further-
more, the generation probabilities lose their significance
during long random walks since they get multiplied at
each step. Therefore, we compute
a79
a71 for small values of
a1. Finally, we define the following:
Definition 3 The a1-step generation probability of docu-
ment a5a9 from the language model of a5a13 :
gena71 a15a5a9 a23a5a13 a17 a24
a27
a71
a87
a78
a51 a79
a87
a7a8a7a88
a1
a49 a89a90
a71
a7a8
a24
a15gen
a71
a15
a5a9
a23
a5
a51
a17 a16 gen
a71
a15
a5a9
a23
a5
a52
a17 a16 a53 a53 a53 a16 gen
a71
a15
a5a9
a23
a5
a54
a17a17 is
the a1-step generation vector of document a5a9. We will often
write a49 a89a90
a71 omitting the document name when we are not
talking about the vector of a specific document.
gen
a71
a15
a5a9 a16 a5a13 a17 is a measure of how likely a random walk
that starts at a5a9 will visit a5a13 in a1 or fewer steps. It helps
us to discover “hidden” similarities between documents
481
that are not immediately obvious from 1-step generation
links. Note that when a1 a24 a35, a49 a89a90 a51
a7a8
is nothing but a49 a58
a7a8
normalized such that the sum of the elements of the vec-
tor is 1. The two are practically the same representations
since we compute the cosine of the vectors during clus-
tering.
3 Related Work
Our work is inspired by three main areas of research.
First, the success of language modeling approaches to
information retrieval (Ponte and Croft, 1998) is encour-
aging for a similar twist to document representation for
clustering purposes. Second, graph-based inference tech-
niques to discover “hidden” textual relationships like the
one we explored in our random walk model have been
successfully applied to other NLP problems such as sum-
marization (Erkan and Radev, 2004; Mihalcea and Ta-
rau, 2004; Zha, 2002), prepositional phrase attachment
(Toutanova et al., 2004), and word sense disambiguation
(Mihalcea, 2005). Unlike our approach, these methods
try to exploit the global structure of a graph to rank the
nodes of the graph. For example, Erkan and Radev (2004)
find the stationary distribution of the random walk on a
graph of sentences to rank the salience scores of the sen-
tences for extractive summarization. Their link weight
function is based on cosine similarity. Our graph con-
struction based on generation probabilities is inherited
from (Kurland and Lee, 2005), where authors used a sim-
ilar generation graph to rerank the documents returned
by a retrieval system based on the stationary distribu-
tion of the graph. Finally, previous research on clustering
graphs with restricted random walks inspired us to clus-
ter the generation graph using a similar approach. Our
a1-step random walk approach is similar to the one pro-
posed by Harel and Koren (2001). However, their algo-
rithm is proposed for “spatial data” where the nodes of
the graph are connected by undirected links that are de-
termined by a (symmetric) similarity function. Our con-
tribution in this paper is to use their approach on textual
data by using generation links, and extend the method to
directed graphs.
There is an extensive amount of research on document
clustering or clustering algorithms in general that we can
not possibly review here. After all, we do not present a
new clustering algorithm, but rather a new representation
of textual data. We explain some popular clustering algo-
rithms and evaluate our representation using them in Sec-
tion 4. Few methods have been proposed to cluster doc-
uments using a representation other than the traditional
a1a2 a3a4a5a2 vector space (or similar term-based vectors). Us-
ing a bipartite graph of terms and documents and then
clustering this graph based on spectral methods is one of
them (Dhillon, 2001; Zha et al., 2001). There are also
general spectral methods that start with a1a2 a3 a4a5a2 vectors,
then map them to a new space with fewer dimensions be-
fore initiating the clustering algorithm (Ng et al., 2001).
The information-theoretic clustering algorithms are
relevant to our framework in the sense that they involve
probability distributions over words just like the language
models. However, instead of looking at the word distri-
butions at the individual document level, they make use
of the joint distribution of words and documents. For ex-
ample, given the set of documents a91 and the set of words
a92 in the document collection, Slonim and Tishby (2000)
first try to find a word clustering a93a92 such that the mutual
information a94 a15a92 a16 a93a92 a17 is minimized (for good compres-
sion) while maximizing the a94 a15 a93a92 a16 a91 a17 (for preserving the
original information). Then the same procedure is used
for clustering documents using the word clusters from the
first step. Dhillon et. al. (2003) propose a co-clustering
version of this information-theoretic method where they
cluster the words and the documents concurrently.
4 Evaluation
We evaluated our new vector representation by compar-
ing it against the traditional a1a2 a3 a4a5a2 vector space repre-
sentation. We ran k-means, single-link, average-link, and
complete-link clustering algorithms on various data sets
using both representations. These algorithms are among
the most popular ones that are used in document cluster-
ing.
4.1 General Experimental Setting
Given a corpus, we stemmed all the documents, removed
the stopwords and constructed the a1a2 a3a4a5a2 vector for each
document by using the bow toolkit (McCallum, 1996).
We computed the a4a5a2 of each term using the following
formula:
idfa15a18 a17 a24 a95a38a50 a52 a96
a0
dfa15a18 a17 a97
where a0 is the total number of documents and dfa15a18 a17 is
the number of documents that the term a18 appears in.
We computed flattened generation probabilities (Equa-
tion 2) for all ordered pairs of documents in a corpus,
and then constructed the corresponding generation graph
(Definition 1). We used Dirichlet-smoothed language
models with the smoothing parameter a42 a24 a35a56a56a56, which
can be considered as a typical value used in information
retrieval. While computing the generation link vectors,
we did not perform extensive parameter tuning at any
stage of our method. However, we observed the follow-
ing:
a14 When
a57 (number of outgoing links per document)
was very small (less than 10), our methods per-
formed poorly. This is expected with such a sparse
vector representation for documents. However, the
performance got rapidly and almost monotonically
482
better as we increased a57 until around a57 a24 a98a56, where
the performance stabilized and dropped after around
a57
a24
a35a56a56. We conclude that using bounded num-
ber of outgoing links per document is not only more
efficient but also necessary as we motivated in Sec-
tion 2.2.
a14 We got the best results when the random walk pa-
rameter a1 a24 a99. When a1 a73 a99, the random walk goes
“out of the cluster” and a49 a89a90a71 vectors become very
dense. In other words, almost all of the graph is
reachable from a given node with 4-step or longer
random walks (assuming a57 is around 80), which is
an indication of a “small world” effect in generation
graphs (Watts and Strogatz, 1998).
Under these observations, we will only report results us-
ing vectors a49 a89a90 a51, a49a89a90a52 and a49 a89a90 a100 with a57 a24 a98a56 regardless
of the data set and the clustering algorithm.
4.2 Experiments with k-means
4.2.1 Algorithm
k-means is a clustering algorithm popular for its sim-
plicity and efficiency. It requires a101, the number of clus-
ters, as input, and partitions the data set into exactly a101
clusters. We used a version of k-means that uses cosine
similarity to compute the distance between the vectors.
The algorithm can be summarized as follows:
1. randomly select a101 document vectors as the initial
cluster centroids;
2. assign each document to the cluster whose centroid
yields the highest cosine similarity;
3. recompute the centroid of each cluster. (centroid
vector of a cluster is the average of the vectors in
that cluster);
4. stop if none of the centroid vectors has changed at
step 3. otherwise go to step 2.
4.2.2 Data
k-means is known to work better on data sets in which
the documents are nearly evenly distributed among dif-
ferent clusters. For this reason, we tried to pick such
corpora for this experiment to be able to get a fair com-
parison between different document representations. The
first corpus we used is classic3,1 which is a collection
of technical paper abstracts in three different areas. We
used two corpora, bbc and bbcsport, that are composed
1ftp://ftp.cs.cornell.edu/pub/smart
of BBC news articles in general and sports news, respec-
tively. 2 Both corpora have 5 news classes each. 20news3
is a corpus of newsgroup articles composed of 20 classes.
Table 1 summarizes the corpora we used together with
the sizes of the smallest and largest class in each of them.
Corpus Documents Classes Smallest Largest
classic3 3891 3 1033 1460
bbcsport 737 5 100 265
bbc 2225 5 386 511
20news 18846 20 628 999
Table 1: The corpora used in the k-means experiments.
4.2.3 Results
We used two different metrics to evaluate the results
of the k-means algorithm; accuracy and mutual informa-
tion. Let a95a9 be the label assigned to a5a9 by the clustering
algorithm, and a102
a9 be a5a9’s actual label in the corpus. Then,
Accuracy a24
a27
a54
a9a78
a51 a103
a15mapa15
a95a9 a17 a16
a102
a9 a17
a0
where
a103
a15a104
a16 a105 a17 equals 1 if
a104
a24 a105 and equals zero other-
wise. mapa15a95a9 a17 is the function that maps the output la-
bel set of the k-means algorithm to the actual label set
of the corpus. Given the confusion matrix of the output,
best such mapping function can be efficiently found by
Munkres’s algorithm (Munkres, 1957).
Mutual information is a metric that does not require
a mapping function. Let a106
a24 a66a95
a51
a16 a95
a52
a16 a53 a53 a53 a16 a95a107
a67 be the
output label set of the k-means algorithm, and a84
a24
a66
a102 a51
a16
a102 a52
a16 a53 a53 a53 a16
a102
a107
a67 be the actual label set of the corpus
with the underlying assignments of documents to these
sets. Mutual information (MI) of these two labelings is
defined as:
MIa15a106 a16 a84 a17 a24 a108
a109
a8
a30
a22 a110a111
a88
a30a112 a113
a15
a95a9 a16
a102
a13 a17 a3 log
a52
a113
a15
a95a9 a16
a102
a13 a17
a113
a15
a95a9 a17 a3
a113
a15a102
a13 a17
where
a113
a15
a95a9 a17 and
a113
a15a102
a13 a17 are the probabilities that a docu-
ment is labeled as a95a9 and a102a13 by the algorithm and in the
actual corpus, respectively;
a113
a15
a95a9 a16
a102
a13 a17 is the probability
that these two events occur at the same time. These val-
ues can be derived from the confusion matrix. We map
the MI metric to the a114a56
a16
a35a115 interval by normalizing it with
the maximum possible MI that can be achieved with the
corpus. Normalized MI is defined as
a116
MI a24 MIa15a106
a16
a84
a17
MIa15a84 a16 a84 a17
2http://www.cs.tcd.ie/Derek.Greene/
research/datasets.html BBC corpora came in
preprocessed format so that we did not perform the processing
with the bow toolkit mentioned in Section 4.1
3http://people.csail.mit.edu/jrennie/
20Newsgroups
483
One disadvantage of k-means is that its performance
is very dependent on the initial selection of cluster cen-
troids. Two approaches are usually used when reporting
the performance of k-means. The algorithm is run mul-
tiple times; then either the average performance of these
runs or the best performance achieved is reported. Re-
porting the best performance is not very realistic since
we would not be clustering a corpus if we already knew
the class labels. Reporting the average may not be very
informative since the variance of multiple runs is usually
large. We adopt an approach that is somewhere in be-
tween. We use “true seeds” to initialize k-means, that is,
we randomly select a101 document vectors that belong to
each of the true classes as the initial centroids. This is
not an unrealistic assumption since we initially know the
number of classes, a101, in the corpus, and the cost of find-
ing one example document from each class is not usually
high. This way, we also aim to reduce the variance of the
performance of different runs for a better analysis.
Table 2 shows the results of k-means algorithm us-
ing a1a2 a3a4a5a2 vectors versus generation vectors a49a89a90 a51 (plain
flattened generation probabilities), a49 a89a90 a52 (2-step random
walks), a49a89a90 a100 (3-step random walks). Taking advantage
of the relatively larger size and number of classes of
20news corpus, we randomly divided it into disjoint par-
titions with 4, 5, and 10 classes which provided us with
5, 4, and 2 new corpora, respectively. We named them
4news-1, 4news-2, a53 a53 a53, 10news-2 for clarity. We ran k-
means with 30 distinct initial seed sets for each corpus.
The first observation we draw from Table 2 is that even
a49 a89a90
a51 vectors perform better than the
a1a2a3a4a5a2 model. This is
particularly surprising given that a49 a89a90 a51 vectors are sparser
than the a1a2 a3a4a5a2 representation for most documents.4 All
a49 a89a90
a71 vectors clearly outperform
a1a2 a3a4a5a2 model often by
a wide margin. The performance also gets better (not al-
ways significantly though) in almost all data sets as we in-
crease the random walk length, which indicates that ran-
dom walks are useful in reinforcing generation links and
inducing new relationships. Another interesting observa-
tion is that the confidence intervals are also narrower for
generation vectors, and tend to get even narrower as we
increase a1.
4.3 Experiments with Hierarchical Clustering
4.3.1 Algorithms
Hierarchical clustering algorithms start with the triv-
ial clustering of the corpus where each document de-
fines a separate cluster by itself. At each iteration, two
“most similar” separate clusters are merged. The algo-
rithm stops after a0 a36 a35 iterations when all the documents
4Remember that we set
a117 a118 a119a120 in our experiments which
means that there can be a maximum of 80 non-zero elements
in a121a122a123 a124. Most documents have more than 80 unique terms in
them.
are merged into a single cluster.
Hierarchical clustering algorithms differ in how they
define the similarity between two clusters at each merg-
ing step. We experimented with three of the most popular
algorithms using cosine as the similarity metric between
two vectors. Single-link clustering merges two clusters
whose most similar members have the highest similarity.
Complete-link clustering merges two clusters whose least
similar members have the highest similarity. Average-link
clustering merges two clusters that yield the highest av-
erage similarity between all pairs of documents.
4.3.2 Data
Corpus Documents Classes Smallest Largest
Reuters 8646 57 2 3735
TDT2 10160 87 2 1843
Table 3: The corpora used in the hierarchical clustering exper-
iments.
Although hierarchical algorithms are not very efficient,
they are useful when the documents are not evenly dis-
tributed among the classes in the corpus and some classes
exhibit a “hierarchical” nature; that is, some classes in the
data might be semantically overlapping or they might be
in a subset/superset relation with each other. We picked
two corpora that may exhibit such nature to a certain ex-
tent. Reuters-215785 is a collection of news articles from
Reuters. TDT26 is a similar corpus of news articles col-
lected from six news agencies in 1998. They contain doc-
uments labeled with zero, one or more class labels. For
each corpus, we used only the documents with exactly
one label. We also eliminated classes with only one doc-
ument since clustering such classes is trivial. We ended
up with two collections summarized in Table 3.
4.3.3 Results
The output of a hierarchical clustering algorithm is a
tree where leaves are the documents and each node in the
tree shows a cluster merging operation. Therefore each
subtree represents a cluster. We assume that each class of
documents in the corpus form a cluster subtree at some
point during the construction of the tree. To evaluate the
cluster tree, we use F-measure proposed in (Larsen and
Aone, 1999). F-measure for a class a57a9 in the corpus and a
subtree a41a13 is defined as
a125
a15a57
a9 a16
a41
a13 a17 a24 a126
a3 a127
a15a57
a9 a16
a41
a13 a17 a3
a113
a15a57
a9 a16
a41
a13 a17
a127
a15a57
a9 a16
a41
a13 a17 a34
a113
a15a57
a9 a16
a41
a13 a17
5http://kdd.ics.uci.edu/databases/
reuters21578/reuters21578.html
6http://www.nist.gov/speech/tests/tdt/
tdt98/index.htm
484
Accuracy ( a128a51a69a69) Normalized Mutual Information ( a128a51a69a69)
Corpus k a71a129 a130
a9
a7a129 a131a132a133 a46 a131a132a133a134 a131a132a133a135 a71a129 a130
a9
a7a129 a131a132a133 a46 a131a132a133a134 a131a132a133a135
classic3 3 a136a137 a138a139a140 a141 a51a138a52a142 a136a139 a138a136a52 a141 a69 a138a140a136 a136a142 a138a139a69 a141 a69 a138a51a136 a143a144 a138a145a143 a141 a146 a138a146a147 a142a148 a138a140a136 a141 a52 a138a137a69 a136a51a138a51a140 a141 a51a138a136a69 a136
a100
a138
a100
a136 a141
a69
a138a137a136 a143a147 a138a149a150 a141 a146 a138a151a152
4news-1 4 a139a52 a138
a100
a52
a141
a52
a138
a142a69 a142a52
a138a139a140 a141
a51
a138a137a137
a142
a137 a138
a142
a140 a141
a51
a138
a100
a69
a144a149 a138a153a144 a141 a151 a138a152a147
a148
a136 a138
a142
a137 a141
a100
a138
a100
a136 a140
a148
a138
a142
a137 a141
a51
a138a137a140 a140a136 a138a139
a52
a141
a69
a138
a142a142
a145a146 a138a145a147 a141 a146 a138a144a143
4news-2 4 a140
a100
a138
a148a52
a141
a52
a138
a142
a100
a139a139 a138
a100a100
a141
a52
a138
a51a148 a142a69
a138a140a140 a141
a51
a138a136
a52
a144a151 a138a152a143 a141 a151 a138a144a146
a100
a148
a138
a69a52
a141
a52
a138a136a139 a137
a148
a138a137a137 a141
a51
a138a136
a69
a137a136 a138a137
a69
a141
a51
a138
a148
a137 a149a146 a138a150a149 a141 a151a138a147 a151
4news-3 4 a140a52 a138
a100
a137 a141
a52
a138a137
a51
a139
a148
a138a136a140 a141
a100
a138
a100
a137 a139
a142
a138a140
a69
a141
a100
a138a140a139 a145a144 a138a144a153 a141 a147 a138a145a153
a100a100
a138a139
a148
a141
a52
a138a140a137 a137
a51
a138a136
a148
a141
a100
a138
a148
a100
a137
a142
a138
a51
a137 a141
a100
a138
a69a142
a153a143 a138a152a150 a141 a152 a138a143a153
4news-4 4 a139
a100
a138
a51a148
a141
a52
a138
a69
a137
a142a51
a138
a100
a140 a141
a51
a138a137
a148 a142
a100
a138a137
a148
a141
a51
a138a137
a69
a144a150 a138a152a152 a141 a151 a138a150a143 a137
a148
a138
a52a148
a141
a52
a138a139
a148
a140a140 a138
a148
a139 a141
a51
a138
a52a148
a140a136 a138a139
a142
a141
a69
a138a136
a69
a145a146 a138a150 a151 a141 a146 a138a144a147
4news-5 4 a140a136 a138a148
a100
a141
a100
a138a140
a142 a142
a139 a138
a69
a140 a141
a52
a138
a100a100
a143a146 a138a146a145 a141 a151 a138a149a149 a136
a69
a138
a69
a140 a141
a51
a138
a148
a139
a148a52
a138a137
a142
a141
a100
a138a136
a69
a140a140 a138
a148
a136 a141
a100
a138
a52
a139 a139
a51
a138a136a137 a141
a52
a138
a148
a137 a145a151 a138a143a149 a141 a152 a138a151a153
5news-1 5 a137a136 a138
a69a69
a141
a52
a138
a148
a100
a139
a100
a138a136
a51
a141
a52
a138
a100
a69
a139a140 a138
a100
a137 a141
a52
a138
a142
a140 a145a149 a138a145a153 a141 a152 a138a153a144
a100
a140 a138a137
a100
a141
a52
a138a136
a148
a137a140 a138
a142a148
a141
a100
a138
a51
a136 a140
a69
a138a139
a148
a141
a52
a138
a142
a139 a149 a151 a138a150a146 a141 a152 a138a153a150
5news-2 5 a137a137 a138a137a136 a141 a52 a138a142a142 a140a136 a138
a100
a52
a141
a52
a138
a51
a139 a139
a52
a138a139a137 a141
a51
a138a139
a148
a145a147 a138a146a153 a141 a151 a138a144a152
a100
a69
a138a136
a148
a141
a52
a138
a148
a137
a148a142
a138
a69a69
a141
a52
a138
a51a142
a137
a52
a138a139a139 a141
a51
a138a137
a52
a153a147 a138a145a153 a141 a151a138a147a146
5news-3 5 a139a51a138a69a139 a141 a52 a138
a100
a136
a142
a100
a138a139
a142
a141
a52
a138
a51
a140
a142
a140 a138
a51a148
a141
a52
a138
a51
a137 a144a149 a138a152a144 a141 a152 a138a152a152
a148
a136 a138a140a137 a141
a52
a138a140
a148
a140a139 a138
a69
a136 a141
a52
a138
a51a52
a139
a51
a138
a51
a100
a141
a51
a138
a142
a139 a145a151 a138a145a146 a141 a151a138a149a145
5news-4 5 a139
a69
a138
a69
a140 a141
a52
a138a140
a148 a142a69
a138
a52a69
a141
a51
a138
a142a148 a142a52
a138a140
a51
a141
a51
a138a140a136 a144a152 a138a149a149 a141 a151 a138a145a144 a137
a69
a138
a69a148
a141
a52
a138
a142a52
a140
a100
a138a140a136 a141
a51
a138a139
a100
a140a140 a138
a142
a136 a141
a51
a138
a100
a52
a149a145 a138a150a143 a141 a151a138a151a150
bbc 5 a142a69 a138a139
a100
a141
a52
a138
a142a69 a142
a139 a138
a148
a140 a141
a52
a138a140
a100
a142
a136 a138
a142a142
a141
a52
a138a137a139 a143a146 a138a144a146 a141 a152 a138a152a149 a140
a100
a138
a100
a148
a141
a100
a138
a52
a100
a139
a148
a138
a51
a136 a141
a52
a138a136
a100
a139
a142
a138
a52a69
a141
a52
a138a140
a100
a145a143 a138a147a143 a141 a152 a138a147a146
bbcsport 5 a139a136 a138a52a137 a141 a52 a138a142a139 a136a148 a138a52a137 a141 a69 a138a142a136 a136a137 a138a148a148 a141 a69 a138a148a52 a143a153 a138a153a149 a141 a146 a138a147a151 a140
a100
a138a136
a148
a141
a100
a138
a52
a139
a142a148
a138a137a136 a141
a51
a138
a100
a148 a142
a140 a138
a148a52
a141
a69
a138a139
a51
a144a149 a138a153a150 a141 a146 a138a153a144
10news-1 10 a137a69 a138a51a51 a141 a52 a138
a100
a69
a140a140 a138
a52a69
a141
a52
a138
a51a52
a140a136 a138
a51a142
a141
a51
a138a139
a100
a149a143 a138a145a149 a141 a151 a138a149a151
a100
a136 a138a136
a142
a141
a51
a138a136a136 a137a137 a138
a52a51
a141
a51
a138a140a139 a137
a142
a138
a142
a140 a141
a51
a138
a51
a137 a153a143 a138a145a146 a141 a146 a138a143a149
10news-2 10 a137a148 a138a51a52 a141 a51a138a139a140 a139a69 a138a69a51 a141 a52 a138a69a69 a139
a100
a138
a148 a51
a141
a51
a138a139
a142
a145a150 a138a151a150 a141 a151 a138a145a153
a148a52
a138
a148a148
a141
a51
a138a137a137 a140
a69
a138a140
a148
a141
a51
a138
a148a69
a140a137 a138
a52a69
a141
a51
a138
a51a51
a149a149 a138a150 a151 a141 a151a138a146a152
20news 20 a148a51a138a148a140 a141 a51a138a69
a100
a137a137 a138a139a137 a141
a51
a138
a52
a137 a137
a142
a138a136a139 a141
a51
a138
a52
a136 a149a146 a138a152a149 a141 a146 a138a143a143
a100
a142
a138
a52a142
a141
a69
a138a139
a100
a137
a52
a138
a148a148
a141
a69
a138a139
a148
a137a140 a138
a100
a148
a141
a69
a138a139
a51
a153a145 a138a150a145 a141 a146 a138a153a147
Table 2: Performances of different vector representations using k-means (average of 30 runs a154 a155a156a157 confidence interval).
TDT2 Reuters-21578
Algorithm a158a159 a160a161a162a159 a121a122a123 a124 a121a122a123 a163 a121a122 a123a164 a158a159 a160a161a162a159 a121a122a123 a124 a121a122 a123a163 a121a122a123 a164
single-link 65.25 82.96 84.22 83.92 59.35 59.37 65.70 66.15
average-link 90.78 93.53 94.04 94.13 78.25 79.17 77.24 81.37
complete-link 29.07 25.04 27.19 34.67 43.66 42.79 45.91 48.36
Table 4: Performances (F-measure a165a166a120a120) of different vector representations using hierarchical algorithms on two corpora.
where a127 a15a57a9 a16 a41a13 a17 and
a113
a15a57
a9 a16
a41
a13 a17 is the recall and the pre-
cision of a41a13 considering the class a57a9. Let a43 be the set
of subtrees in the output cluster tree, and a37 be the set
of classes. F-measure of the entire tree is the weighted
average of the maximum F-measures of all the classes:
a125
a15
a37 a16 a43 a17 a24 a108
a58
a30a167
a0
a58
a0 a168 a169a170a87
a30a45
a125
a15a57
a16
a41
a17
where a0
a58
is the number of documents that belong to class
a57.
We ran all three algorithms for both corpora. Unlike k-
means, hierarchical algorithms we used are deterministic.
Table 4 summarizes our results. An immediate observa-
tion is that average-link clustering performs much bet-
ter than other two algorithms independent of the data set
or the document representation, which is consistent with
earlier research (Zhao and Karypis, 2002). The high-
est result (shown boldface) for each algorithm and cor-
pus was achieved by using generation vectors. However,
unlike in the k-means experiments, a1a2 a3 a4a5a2 was able to
outperform a49 a89a90 a51 and a49 a89a90a52 in one or two cases. a49a89a90 a52
yielded the best result instead of a49 a89a90 a100 in one of the six
cases.
5 Conclusion
We have presented a language model inspired approach
to document clustering. Our results show that even the
simplest version of our approach with nearly no parame-
ter tuning can outperform traditional a1a2 a3a4a5a2 models by a
wide margin. Random walk iterations on our graph-based
model have improved our results even more. Based on the
success of our model, we will investigate various graph-
based relationships for explaining semantic structure of
text collections in the future. Possible applications in-
clude information retrieval, text clustering/classification
and summarization.
Acknowledgments
I would like to thank Dragomir Radev for his useful com-
ments. This work was partially supported by the U.S.
National Science Foundation under the following two
grants: 0329043 “Probabilistic and link-based Methods
for Exploiting Very Large Textual Repositories” admin-
istered through the IDM program and 0308024 “Collab-
orative Research: Semantic Entity and Relation Extrac-
tion from Web-Scale Text Document Collections” admin-
istered by the HLT program. All opinions, findings, con-
clusions, and recommendations in this paper are made by
the authors and do not necessarily reflect the views of the
National Science Foundation.
References
Kevin W. Boyack, Richard Klavans, and Katy B¨orner.
2005. Mapping the backbone of science. Scientometrics,
64(3):351–374.
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-
scale hypertextual web search engine. In Proceedings of the
485
7th International World Wide Web Conference, pages 107–
117.
Maria Fernanda Caropreso, Stan Matwin, and Fabrizio Sebas-
tiani. 2001. A learner-independent evaluation of the use-
fulness of statistical phrases for automated text categoriza-
tion. In Amita G. Chin, editor, Text Databases and Docu-
ment Management: Theory and Practice, pages 78–102. Idea
Group Publishing, Hershey, US.
Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S.
Modha. 2003. Information-theoretic co-clustering. In Pe-
dro Domingos, Christos Faloutsos, Ted SEnator, Hillol Kar-
gupta, and Lise Getoor, editors, Proceedings of the ninth
ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining (KDD-03), pages 89–98, New York,
August 24–27. ACM Press.
Inderjit S. Dhillon. 2001. Co-clustering documents and words
using bipartite spectral graph partitioning. In Proceedings of
the Seventh ACM SIGKDD Conference, pages 269–274.
G¨unes¸ Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-
based lexical centrality as salience in text summarization.
Journal of Artificial Intelligence Research, 22:457–479.
David Harel and Yehuda Koren. 2001. Clustering spatial data
using random walks. In Proceedings of the Seventh ACM
SIGKDD Conference, pages 281–286, New York, NY, USA.
ACM Press.
Oren Kurland and Lillian Lee. 2005. PageRank without hyper-
links: Structural re-ranking using links induced by language
models. In Proceedings of SIGIR.
Bjornar Larsen and Chinatsu Aone. 1999. Fast and effective
text mining using linear-time document clustering. In KDD
’99: Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages
16–22, New York, NY, USA. ACM Press.
Victor Lavrenko, James Allan, Edward DeGuzman, Daniel
LaFlamme, Veera Pollard, and Stephen Thomas. 2002. Rel-
evance models for topic detection and tracking. In Proceed-
ings of HLT, pages 104–110.
Xiaoyong Liu and W. Bruce Croft. 2004. Cluster-based re-
trieval using language models. In Proceedings of SIGIR,
pages 186–193.
Ana G. Maguitman, Filippo Menczer, Heather Roinestad, and
Alessandro Vespignani. 2005. Algorithmic detection of se-
mantic similarity. In WWW ’05: Proceedings of the 14th in-
ternational conference on World Wide Web, pages 107–116,
New York, NY, USA. ACM Press.
Andrew Kachites McCallum. 1996. Bow: A toolkit for sta-
tistical language modeling, text retrieval, classification and
clustering. http://www.cs.cmu.edu/ mccallum/bow.
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing or-
der into texts. In Dekang Lin and Dekai Wu, editors, Pro-
ceedings of EMNLP 2004, pages 404–411, Barcelona, Spain,
July. Association for Computational Linguistics.
Rada Mihalcea. 2005. Unsupervised large-vocabulary word
sense disambiguation with graph-based algorithms for se-
quence data labeling. In Proceedings of Human Language
Technology Conference and Conference on Empirical Meth-
ods in Natural Language Processing, pages 411–418, Van-
couver, British Columbia, Canada, October. Association for
Computational Linguistics.
James Munkres. 1957. Algorithms for the assignment and
transportation problems. Journal of the Society for Indus-
trial and Applied Mathematics, 5(1):32–38, March.
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2001. On
spectral clustering: Analysis and an algorithm. In NIPS,
pages 849–856.
Jay M. Ponte and W. Bruce Croft. 1998. A language modeling
approach to information retrieval. In Proceedings of SIGIR,
pages 275–281.
G. Salton and M. J. McGill. 1983. Introduction to Modern
Information Retrieval. McGraw Hill.
E. Seneta. 1981. Non-negative matrices and markov chains.
Springer-Verlag, New York.
Noam Slonim and Naftali Tishby. 2000. Document clustering
using word clusters via the information bottleneck method.
In SIGIR, pages 208–215.
Chade-Meng Tan, Yuan-Fang Wang, and Chan-Do Lee. 2002.
The use of bigrams to enhance text categorization. Inf. Pro-
cess. Manage, 38(4):529–546.
Kristina Toutanova, Christopher D. Manning, and Andrew Y.
Ng. 2004. Learning random walk models for inducing word
dependency distributions. In ICML ’04: Proceedings of the
twenty-first international conference on Machine learning,
page 103, New York, NY, USA. ACM Press.
Cornelis J. van Rijsbergen. 1979. Information Retrieval. But-
terworths.
Duncan J. Watts and Steven H. Strogatz. 1998. Collective dy-
namics of small-world networks. Nature, 393(6684):440–
442, June 4.
Hongyuan Zha, Xiaofeng He, Chris H. Q. Ding, Ming Gu, and
Horst D. Simon. 2001. Bipartite graph partitioning and data
clustering. In Proceedings of CIKM, pages 25–32.
Hongyuan Zha. 2002. Generic Summarization and Key Phrase
Extraction Using Mutual Reinforcement Principle and Sen-
tence Clustering. Tampere, Finland.
Chengxiang Zhai and John Lafferty. 2004. A study of smooth-
ing methods for language models applied to information re-
trieval. ACM Trans. Inf. Syst. (TOIS), 22(2):179–214.
Ying Zhao and George Karypis. 2002. Evaluation of hierarchi-
cal clustering algorithms for document datasets. In CIKM
’02: Proceedings of the eleventh international conference
on Information and knowledge management, pages 515–524,
New York, NY, USA. ACM Press.
486
