File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-3007_intro.xml

Size: 1,302 bytes

Last Modified: 2025-10-06 14:03:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3007">
  <Title>Document Representation and Multilevel Measures of Document Similarity</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Document indexing and representation of term-document relations are crucial for document classification, clustering and retrieval. In the traditional bag-of-words vector space representation of documents (Salton and McGill, 1983) words represent orthogonal dimensions which makes an unrealistic assumption about their independence.</Paragraph>
    <Paragraph position="1"> Since document vectors are constructed in a very high dimensional vocabulary space, there has been a considerable interest in low-dimensional document representations to overcome the drawbacks of the bag-of-words document vectors. Latent Semantic Analysis (LSA) (Deerwester et al., 1990) is one of the best known dimensionality reduction algorithms in information retrieval.</Paragraph>
    <Paragraph position="2"> In my research, I consider different notions of similarity measure between documents. I use dimensionality reduction and statistical co-occurrence information to define representations that support them.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML