File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3810_metho.xml
Size: 10,003 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3810"> <Title>Graph-based Generalized Latent Semantic Analysis for Document Representation</Title> <Section position="3" start_page="61" end_page="62" type="metho"> <SectionTitle> 2 Graph-based GLSA 2.1 GLSA Framework </SectionTitle> <Paragraph position="0"> The GLSA algorithm (Matveeva et al., 2005) has the following setup. The input is a document collection C with vocabulary V and a large corpus W.</Paragraph> <Paragraph position="1"> 1. For the vocabulary in V , obtain a matrix of pair-wise similarities, S, using the corpus W 2. Obtain the matrix UT of a low dimensional vector space representation of terms that preserves the similarities in S, UT [?] Rkx|V | 3. Construct the term document matrix D for C 4. Compute document vectors by taking linear combinations of term vectors ^D = UTD The columns of ^D are documents in the k-dimensional space.</Paragraph> <Paragraph position="2"> GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction. The inner product between the term and document vectors in the GLSA space preserves the semantic association in the input space. The traditional term-document matrix is used in the last step to provide the weights in the linear combination of term vectors. LSA is a special case of GLSA that uses inner product in step 1 and singular value decomposition in step 2, see (Bartell et al., 1992).</Paragraph> <Section position="1" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 2.2 Singular Value Decomposition </SectionTitle> <Paragraph position="0"> Given any matrix S, its singular value decomposition (SVD) is S = USV T . The matrix Sk = USkV T is obtained by setting all but the first k diagonal elements in S to zero. If S is symmetric, as in the GLSA case, U = V and Sk = USkUT . The inner product between the GLSA term vectors computed as US1/2k optimally preserves the similarities in S wrt square loss.</Paragraph> <Paragraph position="1"> The basic GLSA computes the SVD of S and uses k eigenvectors corresponding to the largest eigenvalues as a representation for term vectors. We will refer to this approach as GLSA. As for LSA, the similarities are preserved globally.</Paragraph> </Section> <Section position="2" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 2.3 Laplacian Eigenmaps Embedding </SectionTitle> <Paragraph position="0"> We used the Laplacian Embedding algorithm (Belkin and Niyogi, 2003) in step 2 of the GLSA algorithm to compute low-dimensional term vectors. Laplacian Eigenmaps Embedding preserves the similarities in S only locally since local information is often more reliable. We will refer to this variant of GLSA as GLSAL.</Paragraph> <Paragraph position="1"> The Laplacian Eigenmaps Embedding algorithm computes the low dimensional vectors y to minimize under certain constraints summationdisplay ij ||yi [?]yj||2Wij.</Paragraph> <Paragraph position="2"> W is the weight matrix based on the graph adjacency matrix. Wij is large if terms i and j are similar according to S. Wij can be interpreted as the penalty of mapping similar terms far apart in the Laplacian Embedding space, see (Belkin and Niyogi, 2003) for details. In our experiments we used a binary adjacency matrix W. Wij = 1 if terms i and j are among the k nearest neighbors of each other and is zero otherwise.</Paragraph> </Section> <Section position="3" start_page="61" end_page="61" type="sub_section"> <SectionTitle> 2.4 Measure of Semantic Association </SectionTitle> <Paragraph position="0"> Following (Matveeva et al., 2005), we primarily used point-wise mutual information (PMI) as a measures of semantic association in step 1 of GLSA.</Paragraph> <Paragraph position="1"> PMI between random variables representing two words, w1 and w2, is computed as</Paragraph> <Paragraph position="3"/> </Section> <Section position="4" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 2.5 GLSA Space </SectionTitle> <Paragraph position="0"> GLSA offers a greater flexibility in exploring the notion of semantic relatedness between terms. In our preliminary experiments, we obtained the matrix of semantic associations in step 1 of GLSA using point-wise mutual information (PMI), likelihood ratio and kh2 test. Although PMI showed the best performance, other measures are particularly interesting in combination with the Laplacian Embedding.</Paragraph> <Paragraph position="1"> Related approaches, such as LSA, the Word Space Model (WS) (Sch&quot;utze, 1998) and Latent Relational Analysis (LRA) (Turney, 2004) are limited to only one measure of semantic association and preserve the similarities globally.</Paragraph> <Paragraph position="2"> Assuming that the vocabulary space has some underlying low dimensional semantic manifold. Laplacian Embedding algorithm tries to approximate this manifold by relying only on the local similarity information. It uses the nearest neighbors graph constructed using the pair-wise term similarities. The computations of the Laplacian Embedding uses the graph adjacency matrix W. This matrix can be binary or use weighted similarities. The advantage of the binary adjacency matrix is that it conveys the neighborhood information without relying on individual similarity values. It is important for co-occurrence based similarity measures, see discussion in (Manning and Sch&quot;utze, 1999).</Paragraph> <Paragraph position="3"> The Locality Preserving Indexing (He et al., 2004) has a similar notion of locality but has to use bag-of-words document vectors.</Paragraph> </Section> </Section> <Section position="4" start_page="62" end_page="63" type="metho"> <SectionTitle> 3 Document Clustering Experiments </SectionTitle> <Paragraph position="0"> We conducted a document clustering experiment for the Reuters-21578 collection. To collect the co-occurrence statistics for the similarities matrix S we used a subset of the English Gigaword collection (LDC), containing New York Times articles labeled as &quot;story&quot;. We had 1,119,364 documents with 771,451 terms. We used the Lemur toolkit1 to tokenize and index all document collections used in our experiments, with stemming and a list of stop words.</Paragraph> <Paragraph position="1"> Since Locality Preserving Indexing algorithm (LPI) is most related to the graph-based GLSAL, we ran experiments similar to those reported in (He et al., 2004). We computed the GLSA document vectors for the 20 largest categories from the Reuters-21578 document collection. We had 8564 documents and 7173 terms. We used the same list of 30 TREC words as in (He et al., 2004) which are listed in table 12. For each word on this list, we generated a cluster as a subset of Reuters documents that contained this word. Clusters are not disjoint and contain documents from different Reuters categories.</Paragraph> <Paragraph position="2"> We computed GLSA, GLSAL, LSA and LPI representations. We report the results for k = 5 for the k nearest neighbors graph for LPI and Laplacian et al., 2004) did not, so that in two cases, two words were reduces to the same stem.</Paragraph> <Paragraph position="3"> matrix. We report results for 300 embedding dimensions for GLSA, LPI and LSA and 500 dimensions for GLSAL.</Paragraph> <Paragraph position="4"> We evaluate these representations in terms of how well the cosine similarity between the document vectors within each cluster corresponds to the true semantic similarity. We expect documents from the same Reuters category to have higher similarity.</Paragraph> <Paragraph position="5"> For each cluster we computed all pair-wise document similarities. All pair-wise similarities were sorted in decreasing order. The term &quot;inter-pair&quot; describes a pair of documents that have the same label. For the kth inter-pair, we computed precision at k as: precision(pk) = #inter [?]pairs pj,s.t. j < kk , where pj refers to the jth inter-pair. The average of the precision values for each of the inter-pairs was used as the average precision for the particular document cluster.</Paragraph> <Paragraph position="6"> Table 1 summarizes the results. The first column shows the words according to which document clusters were generated and the entropy of the category distribution within that cluster. The baseline was to use the tf document vectors. We report results for GLSA, GLSAL, LSA and LPI. The LSA and LPI computations were based solely on the Reuters collection. For GLSA and GLSALwe used the term associations computed for the Gigaword collection, as described above. Therefore, the similarities that are preserved are quite different. For LSA and LPI they reflect the term distribution specific for the Reuters collection whereas for GLSA they are more general.</Paragraph> <Paragraph position="7"> By paired 2-tailed t-test, at p [?] 0.05, GLSA outperformed all other approaches. There was no significant difference in performance of GLSAL, LSA and the baseline. Disappointingly, we could not achieve good performance with LPI. Its performance varies over clusters similar to that of other approaches but the average is significantly lower. We would like to stress that the comparison of our results to those presented in (He et al., 2004) are only suggestive since (He et al., 2004) applied LPI to each cluster separately and used PCA as preprocessing. We computed the LPI representation for the full collection and did not use PCA.</Paragraph> <Paragraph position="8"> The inter-pair accuracy depended on the categories distribution within clusters. For more homogeneous clusters, e.g. &quot;loss&quot;, all methods (except LPI) achieve similar precision. For less homogeneous clusters, e.g. &quot;national&quot;, &quot;industrial&quot;, &quot;bank&quot;, GLSA and LSA outperformed the tf document vectors more significantly.</Paragraph> </Section> class="xml-element"></Paper>