XML Viewer - c02-1045

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1045_metho.xml
Size: 11,821 bytes
Last Modified: 2025-10-06 14:07:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1045">
  <Title>A Method of Cluster-Based Indexing of Textual Data</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Background Issues
</SectionTitle>
    <Paragraph position="0"> A view from indexing In information retrieval research, matrix transformation-based indexing methods such as Latent Semantic Indexing (LSI) (Deerwester et al., 1990) have recently become quite common. These methods can be viewed as an established basis for exposing hidden associations between documents and terms. However, their objective is to generate a compact representation of the original information space, and it is likely in consequence that the resulting orthogonal vectors are dense with many non-zero elements (Dhillon and Modha, 1999). In addition, because the reduction process is globally optimized, matrix transformation-based methods become computationally infeasible when dealing with high-dimensional data.</Paragraph>
    <Paragraph position="1"> A view from clustering The document-clustering problem has also been extensively studied in the past (Iwayama and Tokunaga, 1995; Steinbach et al., 2000).</Paragraph>
    <Paragraph position="2"> The majority of the previous approaches to clustering construct either a partition or a hierarchy of target documents, where the generated clusters are either exclusive or nested. However, generating mutually exclusive or tree-structured clusters in general is a hard-constrained problem and thus is likely to suffer high computational costs when dealing with large-scale data. Also, such a constraint is not necessarily required in actual applications, because 'topics' of documents, or rather 'indices' in our context, are arbitrarily overlapped in nature (Zamir and Etzioni, 1998).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Basic Strategy:
</SectionTitle>
      <Paragraph position="0"> Based on the above observations, our basic strategy is as follows: *Instead of generating component vectors with many non-zero elements, produce only limited subsets of elements, i.e., micro-clusters, with significance weights.</Paragraph>
      <Paragraph position="1"> *Instead of transforming the entire co-occurrence matrix into a different feature space, extract tightly associated sub-structures of the elements on the graphical representation of the matrix.</Paragraph>
      <Paragraph position="2"> *Use entropy-based criteria for cluster evaluation so that the sizes of the generated clusters can be determined independently of other existing clusters.</Paragraph>
      <Paragraph position="3"> *Allow the generated clusters to overlap with each other. By assuming that each element can be categorized into multiple clusters, we can reduce the problem to a feasible level where the clusters are processed individually.</Paragraph>
      <Paragraph position="4"> Related studies: Another important aspect of the proposed micro-clustering scheme is that the method employs simultaneous clustering of its composing elements. This not only enables us to combine issues in term indexing and document clustering, as mentioned above, but also is useful for connecting matrix-based and graph-based notions of clustering; the latter is based on the association networks of the elements extracted from the original co-occurrence matrices. Some recent topics dealing with this sort of duality and/or graphical views include: the Information Bottleneck Method (Slonim and Tishby, 2000), Conceptual Indexing (Dhillon and Modha, 1999; Karypis and Han, 2000), and Bipartite Spectral Graph Partitioning (Dhillon, 2001), although each of these follows its own mathematical formulation.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Clustering Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Definition of Micro-Clusters
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> The co-occurrences of terms and documents can be expressed as a matrix of size M xN in which the (i,j)-th cell indicates that t</Paragraph>
      <Paragraph position="4"> ([?] D). We make the value of the (i,j)-th cell equal to freq(t</Paragraph>
      <Paragraph position="6"> ). Although we primarily assume the value is either '1' (exist) or '0' (not exist) in this paper, our formulation could easily be extended to the cases where freq(t</Paragraph>
      <Paragraph position="8"> represents the actual number of times that t</Paragraph>
      <Paragraph position="10"> over all the documents in D is denoted as freq(t i ,D). Similarly, the observed total frequency of d  j , i.e. the total number of terms contained in d j , is denoted as freq(T,d j ). These values correspond  to summations of the columns and the rows of the co-occurrence matrix. The total frequency of all the documents is denoted as freq(T,D). Thus,</Paragraph>
      <Paragraph position="12"> When a cluster c is being considered, T and D in the above definitions are changed to S</Paragraph>
      <Paragraph position="14"> ), respectively. In the co-occurrence matrix, a cluster is expressed as a 'rectangular' region if terms and documents are so permuted (Figure 2).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Probabilistic Formulation
</SectionTitle>
      <Paragraph position="0"> The view of the co-occurrence matrix can be further extended by assigning probabilities to each cell. With the probabilistic formulation,</Paragraph>
      <Paragraph position="2"> are considered as independently observed events, and their combination as a single co-occurrence event (t</Paragraph>
      <Paragraph position="4"> ) is also considered as a single co-occurrence event of observing one of t</Paragraph>
      <Paragraph position="6"> In estimating the probability of each event, we use a simple discounting method similar to the absolute discounting in probabilistic language modeling studies (Baayen, 2001). The method subtracts a constant value d, called a discounting coefficient, from all the observed term frequencies and estimates the probability of t</Paragraph>
      <Paragraph position="8"> Note that the discounting effect is stronger for low-frequency terms. For high-frequency terms,</Paragraph>
      <Paragraph position="10"> )/F. In the original definition, the value of d was uniquely determined, for example as d =</Paragraph>
      <Paragraph position="12"> with m(1) being the number of terms that appear exactly once in the text. However, we experimentally vary the value of d in our study, because it is an essential factor for controlling the size and quality of the generated clusters.</Paragraph>
      <Paragraph position="13"> Assuming that the probabilities assigned to documents are not affected by the discounting,</Paragraph>
      <Paragraph position="15"> Similarly, the co-occurence probability of S</Paragraph>
      <Paragraph position="17"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Criteria for Cluster Evaluation
</SectionTitle>
      <Paragraph position="0"> The evaluation is based on the information theoretic view of the retrieval systems (Aizawa, 2000). Let T and D be two random variables corresponding to the events of observing a term and a document, respectively. Denote their occurrence probabilities as P(T ) and P(D), and their co-occurrence probability as a joint distribution P(T ,D). By the general definition of traditional information theory, the mutual information between T and D, denoted as</Paragraph>
      <Paragraph position="2"> )/F. Next, the mutual information after agglomerating S</Paragraph>
      <Paragraph position="4"> into a single cluster (Figure 2) is calculated as:</Paragraph>
      <Paragraph position="6"/>
      <Paragraph position="8"> Without discounting, the value of dI(S</Paragraph>
      <Paragraph position="10"> the above equation is always negative or zero.</Paragraph>
      <Paragraph position="11"> However, with discounting, the value becomes positive for uniformly dense clusters, because the frequencies of individual cells are always smaller than their agglomeration and so the discounting effect is stronger for the former.</Paragraph>
      <Paragraph position="12"> Using the same formula, we calculated the significance weights t</Paragraph>
      <Paragraph position="14"> and the significance weights of d</Paragraph>
      <Paragraph position="16"> In other words, all the terms and documents in a cluster can be jointly ordered according to their contribution in the entropy calculation given by Eq. (7).</Paragraph>
      <Paragraph position="17"> To summarize, the proposed probabilistic formulation has the following two major features. First, clustering is generally defined as an operation of agglomerating a group of cells in the contingency table. Such an interpretation is unique because existing probabilistic approaches, including those with a duality view, agglomerate entire rows or columns of the contingency table all at once. Second, the estimation of the occurrence probability is not simply in proportion to the observed frequency. The discounting scheme enables us to trade off (i) the loss of averaging probabilities in the agglomerated clusters, and (ii) the improvement of probability estimations by using larger samples sizes after agglomeration.</Paragraph>
      <Paragraph position="18"> It should be noted that although we have restricted our focus to one-to-one correspondences between terms and documents, the proposed framework can be directly applicable to more general cases with k([?] 2) attributes. Namely, given k random variables X</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Cluster Generation Procedure
</SectionTitle>
      <Paragraph position="0"> The cluster generation process is defined as the repeated iterations of cluster initiation and cluster improvement steps (Aizawa, 2002).</Paragraph>
      <Paragraph position="1"> First, in the cluster initiation step, a single term t i is selected, and an initial cluster is then formulated by collecting documents that contain t i and terms that co-occur with t i within the same document. The collected subsets, respectively, become S</Paragraph>
      <Paragraph position="3"> of the initiated cluster. On the bipartite graph of terms and documents (Figure 2), the process can be viewed as a two-step expansion starting from t</Paragraph>
      <Paragraph position="5"> Next, in the cluster improvement step, all the terms and documents in the initial cluster are tested for elimination in the order of increasing significance weights given by Eqs. (9) and (10). If the performance of the target cluster is improved after the elimination, then the corresponding term or document is removed. When finished with all the terms and documents in the cluster, the newly generated cluster is tested to see whether the evaluation value given by Eq.</Paragraph>
      <Paragraph position="6"> (8) is positive. Clusters that do not satisfy this condition are discarded. Note that the resulting cluster is only locally optimized, as the improvement depends on the order of examining terms and documents for elimination.</Paragraph>
      <Paragraph position="7"> At the initiation step, instead of randomly selecting an initiating term, our current implementation enumerates all the existing terms</Paragraph>
      <Paragraph position="9"> = 50 to avoid explosive computation caused by high frequency terms. Except for k max , the discounting coefficient d is the only parameter that controls the sizes of the generated clusters. The effect of d is examined in detail in the following experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML