File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3810_intro.xml

Size: 3,504 bytes

Last Modified: 2025-10-06 14:04:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3810">
  <Title>Graph-based Generalized Latent Semantic Analysis for Document Representation</Title>
  <Section position="2" start_page="0" end_page="61" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Document indexing and representation of term-document relations are very important issues for document clustering and retrieval. Although the vocabulary space is very large, content bearing words are often combined into semantic classes that contain synonyms and semantically related words.</Paragraph>
    <Paragraph position="1"> Hence there has been a considerable interest in low-dimensional term and document representations.</Paragraph>
    <Paragraph position="2"> Latent Semantic Analysis (LSA) (Deerwester et al., 1990) is one of the best known dimensionality reduction algorithms. The dimensions of the LSA vector space can be interpreted as latent semantic concepts. The cosine similarity between the LSA document vectors corresponds to documents' similarity in the input space. LSA preserves the documents similarities which are based on the inner products of the input bag-of-word documents and it preserves these similarities globally.</Paragraph>
    <Paragraph position="3"> More recently, a number of graph-based dimensionality reduction techniques were successfully applied to document clustering and retrieval (Belkin and Niyogi, 2003; He et al., 2004). The main advantage of the graph-based approaches over LSA is the notion of locality. Laplacian Eigenmaps Embedding (Belkin and Niyogi, 2003) and Locality Preserving Indexing (LPI) (He et al., 2004) discover the local structure of the term and document space and compute a semantic subspace with a stronger discriminative power. Laplacian Eigenmaps Embedding and LPI preserve the input similarities only locally, because this information is most reliable.</Paragraph>
    <Paragraph position="4"> Laplacian Eigenmaps Embedding does not provide a fold-in procedure for unseen documents. LPI is a linear approximation to Laplacian Eigenmaps Embedding that eliminates this problem. Similar to LSA, the input similarities to LPI are based on the inner products of the bag-of-word documents.</Paragraph>
    <Paragraph position="5"> Laplacian Eigenmaps Embedding can use any kind of similarity in the original space.</Paragraph>
    <Section position="1" start_page="0" end_page="61" type="sub_section">
      <SectionTitle>
Generalized Latent Semantic Analysis
</SectionTitle>
      <Paragraph position="0"> (GLSA) (Matveeva et al., 2005) is a framework for computing semantically motivated term and document vectors. It extends the LSA approach by focusing on term vectors instead of the dual document-term representation. GLSA requires a measure of semantic association between terms and a method of dimensionality reduction.</Paragraph>
      <Paragraph position="1"> In this paper, we use GLSA with point-wise mutual information as a term association measure. We introduce the notion of locality into this framework and propose to use Laplacian Eigenmaps Embedding as a dimensionality reduction algorithm. We evaluate the importance of locality for document representation in document clustering experiments.</Paragraph>
      <Paragraph position="2"> The rest of the paper is organized as follows. Sec- null tion 2 contains the outline of the graph-based GLSA algorithm. Section 3 presents our experiments, followed by conclusion in section 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML