XML Viewer - n06-1061

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/n06-1061_relat.xml
Size: 4,033 bytes
Last Modified: 2025-10-06 14:15:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1061">
  <Title>Language Model-Based Document Clustering Using Random Walks</Title>
  <Section position="4" start_page="481" end_page="481" type="relat">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> Our work is inspired by three main areas of research.</Paragraph>
    <Paragraph position="1"> First, the success of language modeling approaches to information retrieval (Ponte and Croft, 1998) is encouraging for a similar twist to document representation for clustering purposes. Second, graph-based inference techniques to discover &amp;quot;hidden&amp;quot; textual relationships like the one we explored in our random walk model have been successfully applied to other NLP problems such as summarization (Erkan and Radev, 2004; Mihalcea and Tarau, 2004; Zha, 2002), prepositional phrase attachment (Toutanova et al., 2004), and word sense disambiguation (Mihalcea, 2005). Unlike our approach, these methods try to exploit the global structure of a graph to rank the nodes of the graph. For example, Erkan and Radev (2004) find the stationary distribution of the random walk on a graph of sentences to rank the salience scores of the sentences for extractive summarization. Their link weight function is based on cosine similarity. Our graph construction based on generation probabilities is inherited from (Kurland and Lee, 2005), where authors used a similar generation graph to rerank the documents returned by a retrieval system based on the stationary distribution of the graph. Finally, previous research on clustering graphs with restricted random walks inspired us to cluster the generation graph using a similar approach. Our a1-step random walk approach is similar to the one proposed by Harel and Koren (2001). However, their algorithm is proposed for &amp;quot;spatial data&amp;quot; where the nodes of the graph are connected by undirected links that are determined by a (symmetric) similarity function. Our contribution in this paper is to use their approach on textual data by using generation links, and extend the method to directed graphs.</Paragraph>
    <Paragraph position="2"> There is an extensive amount of research on document clustering or clustering algorithms in general that we can not possibly review here. After all, we do not present a new clustering algorithm, but rather a new representation of textual data. We explain some popular clustering algorithms and evaluate our representation using them in Section 4. Few methods have been proposed to cluster documents using a representation other than the traditional a1a2 a3a4a5a2 vector space (or similar term-based vectors). Using a bipartite graph of terms and documents and then clustering this graph based on spectral methods is one of them (Dhillon, 2001; Zha et al., 2001). There are also general spectral methods that start with a1a2 a3 a4a5a2 vectors, then map them to a new space with fewer dimensions before initiating the clustering algorithm (Ng et al., 2001).</Paragraph>
    <Paragraph position="3"> The information-theoretic clustering algorithms are relevant to our framework in the sense that they involve probability distributions over words just like the language models. However, instead of looking at the word distributions at the individual document level, they make use of the joint distribution of words and documents. For example, given the set of documents a91 and the set of words a92 in the document collection, Slonim and Tishby (2000) first try to find a word clustering a93a92 such that the mutual information a94 a15a92 a16 a93a92 a17 is minimized (for good compression) while maximizing the a94 a15 a93a92 a16 a91 a17 (for preserving the original information). Then the same procedure is used for clustering documents using the word clusters from the first step. Dhillon et. al. (2003) propose a co-clustering version of this information-theoretic method where they cluster the words and the documents concurrently.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML