File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1061_intro.xml

Size: 11,287 bytes

Last Modified: 2025-10-06 14:03:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1061">
  <Title>Language Model-Based Document Clustering Using Random Walks</Title>
  <Section position="3" start_page="479" end_page="481" type="intro">
    <SectionTitle>
2 Generation Probabilities as Document
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="479" end_page="479" type="sub_section">
      <SectionTitle>
Vectors
2.1 Language Models
</SectionTitle>
      <Paragraph position="0"> The language modeling approach to information retrieval was first introduced by Ponte and Croft (1998) as an alternative (or an improvement) to the traditional a1a2 a3 a4a5a2 relevance models. In the language modeling framework, each document in the database defines a language model.</Paragraph>
      <Paragraph position="1"> The relevance of a document to a given query is ranked according to the generation probability of the query based on the underlying language model of the document. To induce a (unigram) language model from a document, we start with the maximum likelihood (ML) estimation of the term probabilities. For each term a18 that occurs in a document a19 , the ML estimation of a18 with respect to a19 is defined as</Paragraph>
      <Paragraph position="3"> where a25a26 a15a18 a16 a19 a17 is the number of occurences of term a18 in document a19 . This estimation is often smoothed based on the following general formula:</Paragraph>
      <Paragraph position="5"> where a20 a21 a22 a15a18 a23a37 a38a39a20 a40 a41a17 is the ML estimation of a18 over an entire corpus which usually a19 is a member of. a33 is the general smoothing parameter that takes different forms in various smoothing methods. Smoothing has two important roles (Zhai and Lafferty, 2004). First, it accounts for terms unseen in the document preventing zero probabilities. This is similar to the smoothing effect in NLP problems such as parsing. Second, smoothing has an a4a5a2 like effect that accounts for the generation probabilities of the common terms in the corpus. A common smoothing technique is to use Bayesian smoothing with the Dirichlet prior (Zhai and Lafferty, 2004; Liu and Croft, 2004):</Paragraph>
      <Paragraph position="7"> Here, a42 is the smoothing parameter. Higher values of a42 mean more aggressive smoothing.</Paragraph>
      <Paragraph position="8"> Assuming the terms in a text are independent from each other, the generation probability of a text sequence a43 given the document a19 is the product of the generation probabilities of the terms of a43 :</Paragraph>
      <Paragraph position="10"> In the context of information retrieval, a43 is a query usually composed of few terms. In this work, we are interested in the generation probabilities of entire documents that usually have in the order of hundreds of unique terms. If we use Equation 1, we end up having unnatural probabilities which are irrepresentably small and cause floating point underflow. More importantly, longer documents tend to have much smaller generation probabilities no matter how closely related they are to the generating language model. However, as we are interested in the generation probabilities between all pairs of documents, we want to be able to compare two different generation probabilities from a fixed language model regardless of the target document sizes. This is not a problem in the classical document retrieval setting since the given query is fixed, and generation probabilities for different queries are not compared against each other. To address these problems, following (Lavrenko et al., 2002; Kurland and Lee, 2005), we &amp;quot;flatten&amp;quot; the probabilities by normalizing them with respect to the document size:</Paragraph>
      <Paragraph position="12"> a23 is the number of terms in a43 . a20 flat provides us with meaningful values which are comparable among documents of different sizes.</Paragraph>
    </Section>
    <Section position="2" start_page="479" end_page="480" type="sub_section">
      <SectionTitle>
2.2 Using Generation Probabilities as Document
Representations
</SectionTitle>
      <Paragraph position="0"> Equation 2 suggests a representation of the relationship of a document with the other documents in a corpus. Given a corpus of a0 documents to cluster, we form an a0-dimensional generation vector a49 a7a8 a24</Paragraph>
      <Paragraph position="2"> We can use these generation vectors in any clustering algorithm we prefer instead of the classical term-based a1a2 a3 a4a5a2 vectors. The intuition behind this idea becomes clearer when we consider the underlying directed graph representation, where each document is a node and the weight of the link from a5a9 to a5a13 is equal to a20 flat a15a5a9 a23a5a13 a17. An appropriate analogy here is the citation graph of scientific papers. The generation graph can be viewed as a model where documents cite each other. However, unlike real citations, the generation links are weighted and automatically induced from the content.</Paragraph>
      <Paragraph position="3"> The similarity function used in a clustering algorithm over the generation vectors becomes a measure of structural similarity of two nodes in the generation graph.</Paragraph>
      <Paragraph position="4"> Work on bibliometrics uses various similarity metrics to assess the relatedness of scientific papers by looking at the citation vectors (Boyack et al., 2005). Graph-based  similarity metrics are also used to detect semantic similarity of two documents on the Web (Maguitman et al., 2005). Cosine, also the standard metric used in a1a2 a3 a4a5a2 based document clustering, is one of these metrics. Intuitively, the cosine of the citation vectors (i.e. vector of outgoing link weights) of two nodes is high when they link to similar sets of nodes with similar link weights.</Paragraph>
      <Paragraph position="5"> Hence, the cosine of two generation vectors is a measure of how likely two documents are generated from the same documents' language models.</Paragraph>
      <Paragraph position="6"> The generation probability in Equation 2 with a smoothed language model is never zero. This creates two potential problems if we want to use the vector of Equation 3 directly in a clustering algorithm. First, we only want strong generation links to contribute in the similarity function since a low generation probability is not an evidence for semantic relatedness. This intuition is similar to throwing out the stopwords from the documents before constructing the a1a2 a3 a4a5a2 vectors to avoid coincidental similarities between documents. Second, having a dense vector with lots of non-zero elements will cause efficiency problems. Vector length is assumed to be a constant factor in analyzing the complexity of the clustering algorithms. However, our generation vectors are a0-dimensional, where a0 is the number of documents. In other words, vector size is not a constant factor anymore, which causes a problem of scalability to large data sets.</Paragraph>
      <Paragraph position="7"> To address these problems, we use what Kurland and Lee (2005) define as top generators: Given a document a5a9, we consider only a57 documents that yield the largest generation probabilities and discard others. The resultant a0-dimensional vector, denoted a49 a58 a7a8 , has at most a57 non-zero elements, which are the largest a57 elements of a49 a7a8. For a given constant a57, with a sparse vector representation, certain operations (e.g. cosine) on such vectors can be done in constant time independent of a0.</Paragraph>
    </Section>
    <Section position="3" start_page="480" end_page="481" type="sub_section">
      <SectionTitle>
2.3 Reinforcing Links with Random Walks
</SectionTitle>
      <Paragraph position="0"> Generation probabilities are only an approximation of semantic relatedness. Using the underlying directed graph interpretation of the generation probabilities, we aim to get better approximations by accumulating the generation link information in the graph. We start with some definitions. We denote a (directed) graph as a59 a15a60 a16 a18 a17 where a60 is the set of nodes and a18 a61 a60 a62 a60 a63 a64 is the link weight function. We formally define a generation graph as follows: Definition 1 Given a corpus a65 a24 a66a5 a51 a16 a5a52 a16 a53 a53 a53 a16 a5a54 a67 with a0 documents, and a constant a57, the generation graph of a65 is a directed graph a59</Paragraph>
      <Paragraph position="2"> is called the transition probability from node a40 to node a68.</Paragraph>
      <Paragraph position="3"> For example, for a generation graph a59  , there are at most a57 1-step random walks that start at a given node with probabilities proportional to the weights of the outgoing generation links of that node.</Paragraph>
      <Paragraph position="4"> Suppose there are three documents a84, a85 , and a37 in a generation graph. Suppose also that there are &amp;quot;strong&amp;quot; generation links from a84 to a85 and a85 to a37 , but no link from a84 to a37 . The intuition says that a84 must be semantically related to a37 to a certain degree although there is no generation link between them depending on a37 's language model. We approximate this relation by considering the probabilities of 2-step (or longer) random walks from a84 to a37 although there is no 1-step random walk from a84 to</Paragraph>
      <Paragraph position="6"> denote the probability that an a1-step random walk starts at a40 and ends at a68. An interesting property of random walks is that for a given node a68,</Paragraph>
      <Paragraph position="8"> does not depend on a40. In other words, the probability of a random walk ending up at a68 &amp;quot;in the long run&amp;quot; does not depend on its starting point (Seneta, 1981). This limiting probability distribution of an infinite random walk over the nodes is called the stationary distribution of the graph. The stationary distribution is uninteresting to us for clustering purposes since it gives an information related to the global structure of the graph. It is often used as a measure to rank the structural importance of the nodes in a graph (Brin and Page, 1998). For clustering, we are more interested in the local similarities inside a &amp;quot;cluster&amp;quot; of nodes that separate them from the rest of the graph. Furthermore, the generation probabilities lose their significance during long random walks since they get multiplied at each step. Therefore, we compute  a71 for small values of a1. Finally, we define the following: Definition 3 The a1-step generation probability of document a5a9 from the language model of a5a13 :</Paragraph>
      <Paragraph position="10"> the a1-step generation vector of document a5a9. We will often write a49 a89a90 a71 omitting the document name when we are not talking about the vector of a specific document.</Paragraph>
      <Paragraph position="12"> that starts at a5a9 will visit a5a13 in a1 or fewer steps. It helps us to discover &amp;quot;hidden&amp;quot; similarities between documents  that are not immediately obvious from 1-step generation links. Note that when a1 a24 a35, a49 a89a90 a51</Paragraph>
      <Paragraph position="14"> normalized such that the sum of the elements of the vector is 1. The two are practically the same representations since we compute the cosine of the vectors during clustering. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML