Workshop on TextGraphs, at HLT-NAACL 2006, pages 61–64,
New York City, June 2006. c©2006 Association for Computational Linguistics
Graph-based Generalized Latent Semantic Analysis
for Document Representation
Irina Matveeva
Dept. of Computer Science
University of Chicago
Chicago, IL 60637
matveeva@cs.uchicago.edu
Gina-Anne Levow
Dept. of Computer Science
University of Chicago
Chicago, IL 60637
levow@cs.uchicago.edu
Abstract
Document indexing and representation of
term-document relations are very impor-
tant for document clustering and retrieval.
In this paper, we combine a graph-based
dimensionality reduction method with a
corpus-based association measure within
the Generalized Latent Semantic Analysis
framework. We evaluate the graph-based
GLSA on the document clustering task.
1 Introduction
Document indexing and representation of term-
document relations are very important issues for
document clustering and retrieval. Although the
vocabulary space is very large, content bearing
words are often combined into semantic classes that
contain synonyms and semantically related words.
Hence there has been a considerable interest in low-
dimensional term and document representations.
Latent Semantic Analysis (LSA) (Deerwester et
al., 1990) is one of the best known dimensionality
reduction algorithms. The dimensions of the LSA
vector space can be interpreted as latent semantic
concepts. The cosine similarity between the LSA
document vectors corresponds to documents’ sim-
ilarity in the input space. LSA preserves the docu-
ments similarities which are based on the inner prod-
ucts of the input bag-of-word documents and it pre-
serves these similarities globally.
More recently, a number of graph-based dimen-
sionality reduction techniques were successfully ap-
plied to document clustering and retrieval (Belkin
and Niyogi, 2003; He et al., 2004). The main ad-
vantage of the graph-based approaches over LSA is
the notion of locality. Laplacian Eigenmaps Embed-
ding (Belkin and Niyogi, 2003) and Locality Pre-
serving Indexing (LPI) (He et al., 2004) discover the
local structure of the term and document space and
compute a semantic subspace with a stronger dis-
criminative power. Laplacian Eigenmaps Embed-
ding and LPI preserve the input similarities only
locally, because this information is most reliable.
Laplacian Eigenmaps Embedding does not provide
a fold-in procedure for unseen documents. LPI
is a linear approximation to Laplacian Eigenmaps
Embedding that eliminates this problem. Similar
to LSA, the input similarities to LPI are based on
the inner products of the bag-of-word documents.
Laplacian Eigenmaps Embedding can use any kind
of similarity in the original space.
Generalized Latent Semantic Analysis
(GLSA) (Matveeva et al., 2005) is a frame-
work for computing semantically motivated term
and document vectors. It extends the LSA approach
by focusing on term vectors instead of the dual
document-term representation. GLSA requires a
measure of semantic association between terms and
a method of dimensionality reduction.
In this paper, we use GLSA with point-wise mu-
tual information as a term association measure. We
introduce the notion of locality into this framework
and propose to use Laplacian Eigenmaps Embed-
ding as a dimensionality reduction algorithm. We
evaluate the importance of locality for document
representation in document clustering experiments.
The rest of the paper is organized as follows. Sec-
61
tion 2 contains the outline of the graph-based GLSA
algorithm. Section 3 presents our experiments, fol-
lowed by conclusion in section 4.
2 Graph-based GLSA
2.1 GLSA Framework
The GLSA algorithm (Matveeva et al., 2005) has the
following setup. The input is a document collection
C with vocabulary V and a large corpus W.
1. For the vocabulary in V , obtain a matrix of
pair-wise similarities, S, using the corpus W
2. Obtain the matrix UT of a low dimensional
vector space representation of terms that pre-
serves the similarities in S, UT ∈ Rk×|V |
3. Construct the term document matrix D for C
4. Compute document vectors by taking linear
combinations of term vectors ˆD = UTD
The columns of ˆD are documents in the k-
dimensional space.
GLSA approach can combine any kind of simi-
larity measure on the space of terms with any suit-
able method of dimensionality reduction. The inner
product between the term and document vectors in
the GLSA space preserves the semantic association
in the input space. The traditional term-document
matrix is used in the last step to provide the weights
in the linear combination of term vectors. LSA is
a special case of GLSA that uses inner product in
step 1 and singular value decomposition in step 2,
see (Bartell et al., 1992).
2.2 Singular Value Decomposition
Given any matrix S, its singular value decompo-
sition (SVD) is S = UΣV T . The matrix Sk =
UΣkV T is obtained by setting all but the first k di-
agonal elements in Σ to zero. If S is symmetric, as
in the GLSA case, U = V and Sk = UΣkUT . The
inner product between the GLSA term vectors com-
puted as UΣ1/2k optimally preserves the similarities
in S wrt square loss.
The basic GLSA computes the SVD of S and uses
k eigenvectors corresponding to the largest eigenval-
ues as a representation for term vectors. We will re-
fer to this approach as GLSA. As for LSA, the simi-
larities are preserved globally.
2.3 Laplacian Eigenmaps Embedding
We used the Laplacian Embedding algo-
rithm (Belkin and Niyogi, 2003) in step 2 of
the GLSA algorithm to compute low-dimensional
term vectors. Laplacian Eigenmaps Embedding
preserves the similarities in S only locally since
local information is often more reliable. We will
refer to this variant of GLSA as GLSAL.
The Laplacian Eigenmaps Embedding algorithm
computes the low dimensional vectors y to minimize
under certain constraints
summationdisplay
ij
||yi −yj||2Wij.
W is the weight matrix based on the graph adjacency
matrix. Wij is large if terms i and j are similar ac-
cording to S. Wij can be interpreted as the penalty
of mapping similar terms far apart in the Laplacian
Embedding space, see (Belkin and Niyogi, 2003)
for details. In our experiments we used a binary ad-
jacency matrix W. Wij = 1 if terms i and j are
among the k nearest neighbors of each other and is
zero otherwise.
2.4 Measure of Semantic Association
Following (Matveeva et al., 2005), we primarily
used point-wise mutual information (PMI) as a mea-
sures of semantic association in step 1 of GLSA.
PMI between random variables representing two
words, w1 and w2, is computed as
PMI(w1,w2) = log P(W1 = 1,W2 = 1)P(W
1 = 1)P(W2 = 1)
.
2.5 GLSA Space
GLSA offers a greater flexibility in exploring the
notion of semantic relatedness between terms. In
our preliminary experiments, we obtained the matrix
of semantic associations in step 1 of GLSA using
point-wise mutual information (PMI), likelihood ra-
tio and χ2 test. Although PMI showed the best per-
formance, other measures are particularly interest-
ing in combination with the Laplacian Embedding.
Related approaches, such as LSA, the Word Space
Model (WS) (Sch¨utze, 1998) and Latent Relational
Analysis (LRA) (Turney, 2004) are limited to only
one measure of semantic association and preserve
the similarities globally.
62
Assuming that the vocabulary space has some un-
derlying low dimensional semantic manifold. Lapla-
cian Embedding algorithm tries to approximate this
manifold by relying only on the local similarity in-
formation. It uses the nearest neighbors graph con-
structed using the pair-wise term similarities. The
computations of the Laplacian Embedding uses the
graph adjacency matrix W. This matrix can be bi-
nary or use weighted similarities. The advantage
of the binary adjacency matrix is that it conveys
the neighborhood information without relying on in-
dividual similarity values. It is important for co-
occurrence based similarity measures, see discus-
sion in (Manning and Sch¨utze, 1999).
The Locality Preserving Indexing (He et al.,
2004) has a similar notion of locality but has to use
bag-of-words document vectors.
3 Document Clustering Experiments
We conducted a document clustering experiment for
the Reuters-21578 collection. To collect the co-
occurrence statistics for the similarities matrix S
we used a subset of the English Gigaword collec-
tion (LDC), containing New York Times articles la-
beled as “story”. We had 1,119,364 documents with
771,451 terms. We used the Lemur toolkit1 to tok-
enize and index all document collections used in our
experiments, with stemming and a list of stop words.
Since Locality Preserving Indexing algorithm
(LPI) is most related to the graph-based GLSAL, we
ran experiments similar to those reported in (He et
al., 2004). We computed the GLSA document vec-
tors for the 20 largest categories from the Reuters-
21578 document collection. We had 8564 docu-
ments and 7173 terms. We used the same list of 30
TREC words as in (He et al., 2004) which are listed
in table 12. For each word on this list, we generated
a cluster as a subset of Reuters documents that con-
tained this word. Clusters are not disjoint and con-
tain documents from different Reuters categories.
We computed GLSA, GLSAL, LSA and LPI rep-
resentations. We report the results for k = 5 for
the k nearest neighbors graph for LPI and Laplacian
Embedding, and binary weights for the adjacency
1http://www.lemurproject.org/
2We used 28 words because we used stemming whereas (He
et al., 2004) did not, so that in two cases, two words were re-
duces to the same stem.
matrix. We report results for 300 embedding dimen-
sions for GLSA, LPI and LSA and 500 dimensions
for GLSAL.
We evaluate these representations in terms of how
well the cosine similarity between the document
vectors within each cluster corresponds to the true
semantic similarity. We expect documents from the
same Reuters category to have higher similarity.
For each cluster we computed all pair-wise doc-
ument similarities. All pair-wise similarities were
sorted in decreasing order. The term “inter-pair” de-
scribes a pair of documents that have the same label.
For the kth inter-pair, we computed precision at k as:
precision(pk) = #inter −pairs pj,s.t. j < kk ,
where pj refers to the jth inter-pair. The average
of the precision values for each of the inter-pairs was
used as the average precision for the particular doc-
ument cluster.
Table 1 summarizes the results. The first column
shows the words according to which document clus-
ters were generated and the entropy of the category
distribution within that cluster. The baseline was to
use the tf document vectors. We report results for
GLSA, GLSAL, LSA and LPI. The LSA and LPI
computations were based solely on the Reuters col-
lection. For GLSA and GLSALwe used the term as-
sociations computed for the Gigaword collection, as
described above. Therefore, the similarities that are
preserved are quite different. For LSA and LPI they
reflect the term distribution specific for the Reuters
collection whereas for GLSA they are more general.
By paired 2-tailed t-test, at p ≤ 0.05, GLSA outper-
formed all other approaches. There was no signifi-
cant difference in performance of GLSAL, LSA and
the baseline. Disappointingly, we could not achieve
good performance with LPI. Its performance varies
over clusters similar to that of other approaches but
the average is significantly lower. We would like
to stress that the comparison of our results to those
presented in (He et al., 2004) are only suggestive
since (He et al., 2004) applied LPI to each cluster
separately and used PCA as preprocessing. We com-
puted the LPI representation for the full collection
and did not use PCA.
63
word tf glsa glsaL lsa lpi
agreement(1) 0.74 0.73 0.73 0.75 0.46
american(0.8) 0.63 0.72 0.59 0.64 0.36
bank(1.4) 0.45 0.52 0.40 0.48 0.28
control(0.7) 0.78 0.82 0.80 0.80 0.58
domestic(0.8) 0.64 0.68 0.66 0.68 0.35
export(0.8) 0.64 0.65 0.70 0.67 0.37
five(1.3) 0.74 0.77 0.71 0.70 0.40
foreign(1.2) 0.51 0.58 0.55 0.56 0.28
growth(1) 0.51 0.58 0.48 0.54 0.32
income(0.5) 0.84 0.86 0.83 0.80 0.69
increase(1.3) 0.51 0.61 0.53 0.53 0.29
industrial(1.2) 0.59 0.66 0.58 0.61 0.34
internat.(1.1) 0.58 0.59 0.54 0.61 0.34
investment(1) 0.68 0.77 0.70 0.72 0.46
loss(0.3) 0.98 0.99 0.98 0.98 0.88
money(1.1) 0.70 0.62 0.71 0.65 0.38
national(1.3) 0.49 0.58 0.49 0.55 0.27
price(1.2) 0.53 0.63 0.57 0.57 0.29
production(1) 0.56 0.66 0.58 0.59 0.29
public(1.2) 0.58 0.60 0.57 0.57 0.31
rate(1.1) 0.61 0.62 0.64 0.60 0.35
report(1.2) 0.66 0.72 0.62 0.65 0.35
service(0.9) 0.59 0.66 0.56 0.61 0.39
source(1.2) 0.56 0.54 0.59 0.60 0.27
talk(0.9) 0.74 0.67 0.73 0.74 0.39
tax(0.7) 0.91 0.93 0.90 0.89 0.67
trade(1) 0.85 0.74 0.82 0.60 0.33
world(1.1) 0.63 0.65 0.68 0.66 0.33
Av. Acc 0.65 0.68 0.65 0.66 0.40
Table 1: Average inter-pairs accuracy.
The inter-pair accuracy depended on the cate-
gories distribution within clusters. For more homo-
geneous clusters, e.g. “loss”, all methods (except
LPI) achieve similar precision. For less homoge-
neous clusters, e.g. “national”, ”industrial”, ”bank”,
GLSA and LSA outperformed the tf document vec-
tors more significantly.
4 Conclusion and Future Work
We introduced a graph-based method of dimension-
ality reduction into the GLSA framework. Lapla-
cian Eigenmaps Embedding preserves the similar-
ities only locally, thus providing a potentially bet-
ter approximation to the low dimensional semantic
space. We explored the role of locality in the GLSA
representation and used binary adjacency matrix as
similarity which was preserved and compared it to
GLSA with unnormalized PMI scores.
Our results did not show an advantage of GLSAL.
GLSAL and LPI seem to be very sensitive to the pa-
rameters of the neighborhood graph. We tried dif-
ferent parameter settings but more experiments are
required for a thorough analysis. We are also plan-
ning to use a different document collection to elimi-
nate the possible effect of the specific term distribu-
tion in the Reuters collection. Further experiments
are needed to make conclusions about the geometry
of the vocabulary space and the appropriateness of
these methods for term and document embedding.

References
Brian T. Bartell, Garrison W. Cottrell, and Richard K.
Belew. 1992. Latent semantic indexing is an optimal
special case of multidimensional scaling. In Proc. of
the 15th ACM SIGIR, pages 161–167. ACM Press.
Mikhail Belkin and Partha Niyogi. 2003. Laplacian
eigenmaps for dimensionality reduction and data rep-
resentation. Neural Computation, 15(6):1373–1396.
Scott C. Deerwester, Susan T. Dumais, Thomas K. Lan-
dauer, George W. Furnas, and Richard A. Harshman.
1990. Indexing by latent semantic analysis. Jour-
nal of the American Society of Information Science,
41(6):391–407.
Xiaofei He, Deng Cai, Haifeng Liu, and Wei-Ying Ma.
2004. Locality preserving indexing for document rep-
resentation. In Proc. of the 27rd ACM SIGIR, pages
96–103. ACM Press.
Chris Manning and Hinrich Sch¨utze. 1999. Founda-
tions of Statistical Natural Language Processing. MIT
Press. Cambridge, MA.
Irina Matveeva, Gina-Anne Levow, Ayman Farahat, and
Christian Royer. 2005. Generalized latent semantic
analysis for term representation. In Proc. of RANLP.
Hinrich Sch¨utze. 1998. Automatic word sense discrimi-
nation. Computational Linguistics, 24(21):97–124.
Peter D. Turney. 2004. Human-level performance on
word analogy questions by latent relational analysis.
Technical report, Technical Report ERB-1118, NRC-
47422.
