XML Viewer - w02-2009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2009_metho.xml
Size: 20,372 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2009">
  <Title>Cross-dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Algorithmic Framework
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Review of the IB Clustering Algorithm
</SectionTitle>
      <Paragraph position="0"> The information bottleneck (IB) iterative clustering method is a recent approach to soft (probabilistic) clustering for a single set, denoted by X, consisting of elements to be clustered (Tishby, Pereira &amp; Bialek, 1999). Each element x[?]X is identified by a probabilistic feature vector, with an entry, p(y|x), for every feature y from a pre-determined set of features Y. The p(y|x) values are estimated from given co- null p(y|x) = 1 for every x in X).</Paragraph>
      <Paragraph position="1"> The IB algorithm is derived from information theoretic considerations that we do not address here. It computes, through an iterative EM-like process, probabilistic assignments p(c|x) for each element x into each cluster c. Starting with random (or heuristically chosen) p(c|x) values at time t = 0, the IB algorithm iterates the following steps until convergence: IB1: Calculate for each cluster c its marginal  (Bayes' rule is used to compute p(x|c)).</Paragraph>
      <Paragraph position="2"> IB3: Calculate for each element x and each cluster c a value p(c|x), indicating the probability of assignment of x into c:</Paragraph>
      <Paragraph position="4"> Cover &amp; Thomas, 1991).</Paragraph>
      <Paragraph position="5"> The parameter b controls the sensitivity of the clustering procedure to differences between the p(y|c) values. The higher b is, the more determined the algorithm becomes in assigning each element into the closest cluster. As b is increased, more clusters that are separable from each other are obtained upon convergence (the target number of clusters is fixed). We want to ensure, however, that assignments do not follow more than necessary minute details of the given data, as a result of too high b (similarly to over generalization in supervised settings). The IB algorithm is therefore applied repeatedly in a cooling-like process: it starts with a low b value, corresponding to low temperature, which is increased every repetition of the whole iterative converging cycle, until the desired number of separate clusters is obtained.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Cross-dataset (CD) Clustering Method
</SectionTitle>
      <Paragraph position="0"> The (soft) CD clustering algorithm receives as input multiple datasets along with their feature vectors. In the current application, we have three sets extracted from the corresponding</Paragraph>
      <Paragraph position="2"> each of ~150 keywords to be clustered. A particular keyword might appear in two or more of the datasets, but the CD setting considers it as a distinct element within each dataset, thus keeping the sets of clustered elements disjoint. Like the IB clustering algorithm, the CD algorithm produces probabilistic assignments of the data elements.</Paragraph>
      <Paragraph position="3"> The feature set Y consists, in the current work, of about 7000 content words, each occurs in at least two of the examined corpora. The set of features is used commonly for all datasets, thus it underlies a common representation, which enables the clustering process to compare elements of different sets.</Paragraph>
      <Paragraph position="4"> Naively approached, the original IB algorithm could be utilized unaltered to the multipledataset setting, simply by applying it to the unified set X, consisting of the union of the</Paragraph>
      <Paragraph position="6"> 's. The problem of this simplistic approach is that each dataset has its own characteristic features and feature combinations, which correspond to prominent topics discussed uniquely in that corpus. A standard clustering method, such as the IB algorithm, would have a tendency to cluster together elements that originate in the same dataset, producing clusters populated mostly by elements from a single dataset (cf. Marx et al, 2002). The goal of CD clustering is to neutralize this tendency and to create clusters containing elements that share common features across different datasets.</Paragraph>
      <Paragraph position="7"> To accomplish this goal, we change the criterion by which elements are assigned into clusters.</Paragraph>
      <Paragraph position="8"> Recall that the assignment of an element x to a cluster c is determined by the similarity of their characterizing feature distributions, p(y|x) and p(y|c) (step IB3). The problem lies in using the p(y|c) distribution, which is determined by summing p(y|x) values over all cluster elements, to characterize a cluster without taking into account dataset boundaries. Thus, for a certain y, p(y|c) might be high despite of being characteristic only for cluster elements originating in a single dataset. This results in the tendency discussed above to favor clusters consisting of elements of a single dataset.</Paragraph>
      <Paragraph position="9"> Therefore, we define a biased probability distribution, p</Paragraph>
      <Paragraph position="11"> (y), to be used by the CD clustering algorithm for characterizing a cluster c. It is designed to call attention to y's that are typical for cluster members in all, or most, different datasets. Consequently, an element x would be assigned to a cluster c (as in step IB3) in accordance to the degree of similarity between its own characteristic features and those characterizing other cluster members from all datasets. The resulting clusters would thus contain representatives of all datasets.</Paragraph>
      <Paragraph position="12"> The definition of p  c (y) is based on the joint probability p(y,c,X i ). First, compute the geometric mean of p(y,c,X</Paragraph>
      <Paragraph position="14"> ) are calculated).</Paragraph>
      <Paragraph position="15"> r is not a probability measure, but just a function of y and c into [0,1]. However, since a geometric mean reflects uniformity of the averaged values, r captures the degree to which</Paragraph>
      <Paragraph position="17"> ) values are high across all datasets.</Paragraph>
      <Paragraph position="18"> We found empirically that at this stage, it is advantageous to normalize r across all clusters and then to rescale the resulting probabilities (over the c's, for each y) by the original p(y):</Paragraph>
      <Paragraph position="20"> Finally, to obtain a probability distribution over y for each cluster c, normalize the obtained</Paragraph>
      <Paragraph position="22"> to p(y|c) in IB), while ensuring that the feature-based similarity of c to any element x reflects feature distribution across all data sets.</Paragraph>
      <Paragraph position="23"> The CD clustering algorithm, starting at t = 0, iterates, in correspondence to the IB algorithm, the following steps: CD1: Calculate for each cluster c its marginal probability (same as IB1):</Paragraph>
      <Paragraph position="25"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="8" type="metho">
    <SectionTitle>
3 CD Clustering for Religion Comparison
</SectionTitle>
    <Paragraph position="0"> The three corpora that are focused on the compared religions Buddhism, Christianity and Islam have been downloaded from the Internet. Each corpus contains 20,00040,000 word tokens (510 Megabyte). We have used a text mining tool to extract most religion keywords that form the three sets to which we applied the CD algorithm. The software we have used TextAnalyst 2.0 identifies within the corpora key-phrases, from which we have excluded items that appear in fewer than three  cluster c and each religion, the 15 keywords x with the highest probability of assignment within the cluster are displayed (assignment probabilities, i.e. p(c|x) values are indicated in brackets). Terms that were used by the expert (see Table 2) are underlined.  god (.68), amida (.58), bodhisattva (.50), salvation (.45), enlightenment (.43), deva (.43), attain (.41), sacrifice (.39), awaken (.25), spirit (.25), nirvana (.24), buddha nature (.24), humanity (.22), speech (.18), teach (.18) god (.69), good works (.65), love of god (.62), salvation (.60), gift (.58), intercession (.56), repentance (.55), righteousness (.53), peace (.52), love (.51), obey god (.49), saviour (.48), atonement (.46), holy ghost (.45), jesus christ (.45) god (.86), one god (.84), allah (.76), bless (.76), worship (.75), submission (.73), peace (.73), command (.72), guide (.71), divinity (.70), messanger (.70), believe (.62), mankind (.61), commandment (.58), witness (.57) c</Paragraph>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
(Customs and Festivals)
</SectionTitle>
      <Paragraph position="0"> full moon (.99), stupa (.98), mantra (.96), pilgrim (.96), monastery (.89), temple (.86), statue (.73), worship (.61), monk (.54), mandala (.32), trained (.23), bhikkhu (.15), disciple (.12), meditation (.11), nun (.11) easter (.99), sunday (.99), christmas (.99), service (.98), city (.98), eucharist (.96), pilgrim (.95), pentecost (.93), jerusalem (.91), pray (.89), worship (.82), minister (.73), ministry (.70), read bible (.50), mass (.24) id al fitr (.99), friday (.99), ramadan (.99), eid (.99), pilgrim (.99), mosque (.99), mecca (.99), kaaba (.99), salat (.99), fasting (.99), medina (.98), city (.98), pray (.98), hijra (.97), charity (.96) c</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="7" type="sub_section">
      <SectionTitle>
(Spiritual States)
</SectionTitle>
      <Paragraph position="0"> phenomena (.94), problem (.93), mindfulness (.92), awareness (.92), consciousness (.91), law (.88), emptiness (.88), samadhi (.87), sense (.87), experience (.86), wisdom (.83), moral (.83), karma (.82), find (.81), exist (.80) moral (.96), problem (.94), argue (.91), question (.87), argument (.74), experience (.73), incarnation (.72), relationship (.71), idolatry (.58), find (.45), law (.41), learn (.38), confession (.34), foundation (.32), faith (.31) moral (.93), spirit (.79), question (.75), life (.71), freedom (.67), existence (.56), humanity (.53), find (.52), faith (.52), code (.51), law (.41), universe (.39), being (.36), teach (.35),</Paragraph>
      <Paragraph position="2"> (Sorrow, Sin, Punishment and Reward) lamentation (.99), grief (.99), animal (.89), pain (.87), death (.86), kill (.84), reincarnation (.81), realm (.76), samsara (.69), rebirth (.61), dukkha (.56), anger (.53), soul (.43), nirvana (.43), birth (.33) punish (.94), hell (.93), violence (.86), fish (.86), sin (.83), earth (.81), soul (.78), death (.77), sinner (.76), sinful (.74), heaven (.73), satan (.72), suffer (.71), flesh (.71), judgment (.67) hell (.97), earth (.88), heaven (.87), death (.85), sin (.85), alcohol (.69), satan (.60), face (.59), day of judgment (.52), deed (.48), angel (.25), being (.24), universe (.16), existence (.13), bearing (.12)</Paragraph>
      <Paragraph position="4"> (Schools, Traditions and their Originating Places) korea (.99), china (.99), tibet (.99), theravada (.99), school (.99), asia (.99), founded (.99), west (.99), sri lanka (.99), mahayana (.99), india (.99), history (.99), hindu (.99), japan (.99), study (.99) cardinal (.99), orthodox (.99), protestant (.99), university (.99), vatican (.99), catholic (.99), bishop (.99), rome (.99), pope (.99), monk (.99), tradition (.99), theology (.99), baptist (.98), church (.98), divinity (.93) africa (.99), shiite (.99), sunni (.99), shia (.99), west (.99), christianity (.99), arab (.99), founded (.98), arabia (.97), sufi (.96), history (.96), fiqh (.95), scholar (.91), imam (.90), jew (.89)</Paragraph>
      <Paragraph position="6"> gautama (.96), king (.95), friend (.68), disciple (.60), birth (.48), hear (.43), ascetic (.41), amida (.40), deva (.33), teach (.19), sacrifice (.15), statue (.14), buddha (.12), bodhisattva (.12), dharma (.09) bethlehem (.98), jordan (.97), mary (.95), lamb (.90), king (.90), second coming (.81), born (.76), israel (.74), child (.73), elijah (.72), baptize (.70), john the baptist (.68), priest (.68), adultery (.65), zion (.61) husband (.99), ismail (.98), father (.97), son (.95), mother (.94), born (.92), wife (.92), child (.89), ali (.88), musa (.71), isa (.70), ibrahim (.67), caliph (.43), tribe (.35), saint (.30)</Paragraph>
      <Paragraph position="8"> . Thus, composite and rare terms as well as phrases that the software has inappropriately segmented were filtered out. We have added to the automatically extracted terms additional items contributed by a comparative religion expert (about 15% of the sets were thus not extracted automatically, but those terms occur frequently enough to underlie informative co-occurrence vectors).</Paragraph>
      <Paragraph position="9"> The common set of features consists of all corpus words that occur in at least three different documents within two or three of the corpora, excluding a list of common function words. Co-occurrences were counted within a bi-directional five-word window, truncated by sentence ends.</Paragraph>
      <Paragraph position="10"> The number of clusters produced seven was empirically determined as the maximal number with relatively large proportion (p(c) &gt; .01) for all clusters. Trying eight clusters or more, we obtain clusters of minute size, which apparently do not reveal additional themes or topics. Table 1 presents, for each cluster c and each religion, the 15 keywords x with the highest p(c|x) values. The number 15 has no special meaning other than providing rich, balanced and displayable notion of all clusters. The displayed 315 keyword subsets are denoted c  .</Paragraph>
      <Paragraph position="11"> We gave each cluster a title, reflecting our (naive) impression of its content. As we interpret the clusters, they indeed reveal prominent aspects of religion: rituals (c  are reflected as well, in spite of the very different position taken by the distinct religions with regard to these issues.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
3.1 Comparison to Expert Data
</SectionTitle>
      <Paragraph position="0"> We have asked an expert of comparative religion studies to simulate roughly the CD clustering task: assigning (freely-chosen) keywords into corresponding subsets, reflecting prominent resembling aspects that cut across the three examined religions. The expert was not asked to indicate a probability of assignment, but he was allowed to use the same keyword in more than  An evaluation copy of TextAnalyst, by MicroSystems Ltd., can be downloaded from http://www.megaputer.com/php/eval.php3 one cluster. The expert clusters, with the exclusion of terms that we were not able to locate in our corpora, are displayed in Table 2. In addition to our tags e  the expert gave a title to each cluster.</Paragraph>
      <Paragraph position="1"> Although the keyword-clustering task is highly subjective, there are notable overlapped regions shared by the expert clusters and ours. Two expert topics Books (e  make sense but are not covered by the expert. To quantify the overall level of overlap between our output and the expert's, we introduce suitable versions of recall and precision measures.</Paragraph>
      <Paragraph position="2"> We want the recall measure to reflect the proportion of expert terms that are captured by our configuration, provided that an optimal correspondence between our clusters to the expert is considered. Hence, for each expert cluster, e</Paragraph>
      <Paragraph position="4"> number of overlapping terms (note that two or more expert clusters are allowed to be covered by a single c k , to reflect cases where several related sub-topics are merged within our results). Denote this maximal number by M(e</Paragraph>
      <Paragraph position="6"> Consequently, the recall measure R is defined to be the sum of the above maximal overlap counts over all expert clusters, divided by all 131 expert terms (repetitions in distinct clusters counted):</Paragraph>
      <Paragraph position="8"> To estimate how precise our results are, we are interested in the relative part of our clusters, reduced to the expert terms, which has been assigned to the right expert cluster by the same optimal correspondence. Note that in this case we do not want to sum up several M values that are associated with a single c k : a single cluster covering several expert clusters should be considered as an indication of poor precision. Furthermore, if we do this, we might recount some of c k 's terms (specifically, keywords that the expert has included in several clusters; this might result in precision &gt; 100%). We need therefore to consider at most one M value per c</Paragraph>
      <Paragraph position="10"> empty, i.e. there is no e  values, divided by the number of expert terms appearing among the c</Paragraph>
      <Paragraph position="12"> (repetitions counted), which are, in the current case, the 94 underlined terms in Table 1:</Paragraph>
      <Paragraph position="14"> Our algorithm has achieved the following  0.33). As we have expected, three of the clusters produced by the IB algorithms are populated, with very high probability, by most keywords of a single religion. Within these specific religion clusters as well as the other sparsely populated clusters, the ranking inducted by the p(c|x) values is not very informative regarding particular sub-topics. Thus, the IB performs the CD clustering task poorly, even in comparison to random results. We note that, similarly to our algorithm, the IB algorithm produces at most 7 clusters of non- null cluster, the best fitting automated cross-dataset cluster is indicated on the right-hand side, as well as the number of relevant expert words it includes. The terms of this best-fit cluster are underlined. Superscripts indicate indices of the cross-dataset cluster(s), among c  impression that the limit on number of interesting clusters reflects intrinsic exhaustion of the information embodied within the given data. It is yet to be carefully examined whether this observation provides any hint regarding the general issue of the right number of clusters.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="8" end_page="8" type="metho">
    <SectionTitle>
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> This paper addressed the relatively unexplored problem of detecting corresponding themes across multiple corpora. We have developed an extended clustering algorithm that is based on the appealing and highly general Information Bottleneck method. Substantial effort has been devoted to adopting this method for the Cross-Dataset clustering task.</Paragraph>
    <Paragraph position="1"> Our approach was demonstrated empirically on the challenging task of finding corresponding themes across different religions. Subjective examination of the system's output, as well as its comparison to the output of a human expert, demonstrate the potential benefits of applying this approach in the framework of comparative research, and possibly in additional text mining applications.</Paragraph>
    <Paragraph position="2"> Given the early stage of this line of research, there is plenty of room for future work. In particular, further research is needed to provide theoretic grounding for the CD clustering formulations and to specify their properties.</Paragraph>
    <Paragraph position="3"> Empirical work is needed to explore the potential of the proposed paradigm for other textual domains as well as for related applications. Particularly, we have recently presented a similar framework for template induction in information extraction (crosscomponent clustering, Marx, Dagan, &amp; Shamir, 2002), which should be studied in relation to the CD algorithm presented here.</Paragraph>
    <Paragraph position="4"> Appendix The value of p(X i ), which is required for the calculations in Section 3.2, is given directly from the input co-occurrence data as follows:</Paragraph>
    <Paragraph position="6"> ) are calculated from values that are available at time step t[?]1:</Paragraph>
    <Paragraph position="8"/>
  </Section>
class="xml-element"></Paper>
Download Original XML