File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1123_metho.xml

Size: 28,326 bytes

Last Modified: 2025-10-06 14:09:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1123">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 979-986, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics A Generalized Framework for Revealing Analogous Themes across Related Topics</Title>
  <Section position="3" start_page="979" end_page="979" type="metho">
    <SectionTitle>
2 The cross-partition clustering problem
</SectionTitle>
    <Paragraph position="0"> The cross-partition clustering problem is an extension of the standard (single-set) data clustering problem. In the cross-partition setting, the dataset is pre-partitioned into several distinct subsets of elements to be clustered. For example, in our experiments each of these subsets consisted of topical key terms to be clustered. Each such subset was extracted automatically from a sub-corpus corresponding to a different religion (see Section 5).</Paragraph>
    <Paragraph position="1"> As in the standard clustering problem, our goal is to cluster the data such that each term cluster would capture a particular theme in the data.</Paragraph>
    <Paragraph position="2"> However, the generated clusters are expected to identify themes that cut across all the given subsets. For example, one cluster consists of names of festivals of different religions, such as Easter, Christmas, Sunday (Christianity) Ramadan, Friday, Id-al-fitr (Islam) and Sukoth, Shavuot, Passover (Judaism; see Figure 4 for more examples).</Paragraph>
  </Section>
  <Section position="4" start_page="979" end_page="980" type="metho">
    <SectionTitle>
3 Distributional clustering
</SectionTitle>
    <Paragraph position="0"> Our algorithmic framework elaborates on Pereira et al.'s (1993) distributional clustering method.</Paragraph>
    <Paragraph position="1"> Distributional clustering probabilistically clusters data elements according to the distribution of a given set of features associated with the data. Each data element x is represented as a probability distribution pa0 ya1xa2 over all features y. In our data pa0 ya1xa2 is the empirical co-occurrence frequency of a feature word y with a key term x, normalized over all feature word co-occurrences with x.</Paragraph>
    <Paragraph position="2"> The distributional clustering algorithmic scheme (Figure 1) is a probabilistic (soft) version of the  well-known K-means algorithm. It iteratively alternates between: (1) Calculating assignments to clusters: calculate  an assignment probability pa0 ca1xa2 for each data elements x into each one of the clusters c. This soft assignment is proportional to an information theoretic distance (KL divergence) between the element's pa0 ya1xa2 representation, and the centroid of c, represented by a distribution pa0 ya1ca2 . The marginal cluster probability pa0 ca2 may optionally be set as a prior in this calculation, as in Tishby et al. (1999; in Figure 1 we mark it with dotted underline, to denote it is optional).</Paragraph>
    <Paragraph position="3"> Set t a3a5a4 , and repeatedly iterate the two update-steps below, till convergence (at time step t a3a6a4 , initialize</Paragraph>
    <Paragraph position="5"> distributional clustering algorithm (with a fixed a36 value and a fixed number of clusters). The</Paragraph>
    <Paragraph position="7"> (2) Calculating cluster centroids: calculate a probability distribution pa0 ya1ca2 over all features y given each cluster c, based on the feature distribution of cluster elements, weighed by the pa0 ca1xa2 assignment probability calculated in step (1) above.</Paragraph>
    <Paragraph position="8"> This step imposes the independence of the clusters c of the features y given the data x (similarly to the naive Bayes supervised framework).</Paragraph>
    <Paragraph position="9"> Subsequent works (Tishby et al., 1999; Gedeon et al., 2003) have studied and motivated further the earlier distributional clustering method. Particularly, it can be shown that the algorithm of Figure</Paragraph>
    <Paragraph position="11"> where H denotes entropy1 and X, Y and C are formal variables whose values range over all data elements, features and clusters, respectively.</Paragraph>
    <Paragraph position="12"> Tishby et al.'s (1999) information bottleneck method (IB) includes the marginal cluster entropy  ca41 in step (1) of the algorithm.</Paragraph>
    <Paragraph position="13"> The parameter a36 that appears in the cost term and in step (1) of the algorithm can have any positive real value. It counterbalances the relative impact of the considerations of maximizing feature information conveyed by the partition to clusters, i.e. minimizing Ha0 Ya1Ca2 , versus applying the maximum entropy principle to the cluster assignment probabilities (see Gedeon et al., 2004), i.e., maximizing Ha0 Ca1Xa2 . The higher a36 is, the more &amp;quot;determined&amp;quot; the algorithm becomes in assigning each element into the most appropriate cluster. In subsequent runs of the algorithm a36 can be increased, yielding more separable clusters (clusters with noticeably different centroids) upon convergence.</Paragraph>
    <Paragraph position="14"> The runs can repeat until, for some a36 , the desired number of separate clusters is obtained.</Paragraph>
  </Section>
  <Section position="5" start_page="980" end_page="982" type="metho">
    <SectionTitle>
4 The cross-partition clustering method
</SectionTitle>
    <Paragraph position="0"> In the cross-partition framework, the pre-partition of the data to subsets is captured through an addi- null tional formal variable W, whose values range over the subsets. In our data, each religion corresponds to a different W value, w. Each religion-related key term x is associated with one religion w, with pa0 wa1xa2a20a19a22a21 (and pa0 w'a1xa2a23a19 0 for any w' a24 w). Formally, our framework allows probabilistic prepartition, i.e., pa0 wa1xa2 values between 0 and 1 but this option was not examined empirically.</Paragraph>
    <Paragraph position="1"> The Cross-Partition (CP) clustering method (Figure 2) is an extended version of the probabilistic K-means scheme, introducing additional steps in the iterative loop that incorporate the added pre-partition variable W:  (1) Calculating assignments to clusters, i.e. probabilistic pa0 ca1xa2 values, is based on current values of cluster centroids, as in distributional clustering. (2) Calculating subset-projected cluster centroids.</Paragraph>
    <Paragraph position="2">  Given the current element assignments, centroids are computed separately for each combination of Set t a3a6a4 and repeatedly iterate the following update steps sequence, till convergence (in the first iteration, when t a3a6a4 randomly or arbitrarily initialize pt</Paragraph>
    <Paragraph position="4"> algorithm (with fixed a36 and a44 values and a fixed number of clusters). The terms marked by dotted  a cluster c projected on a pre-given subset w. Each such subset-projected centroid is given by a probability distribution pa0 ya1c,wa2 over the features y, for each c and w separately (instead of pa0 ya1ca2 . (3) Re-evaluating cluster-feature association.</Paragraph>
    <Paragraph position="5"> Based on the subset projected centroids, the associations between features and clusters are reevaluated: features that are commonly prominent across all subsets are promoted relatively to features with varying prominence. A weighted geometric mean scheme achieves this effect: the value of a0 w pa0 ya1c,wa2a2a1 pa3 wa4 is larger as the different pa0 ya1c,wa2 values are distributed more uniformly over the different w's, for any given c and y. a44 is a positive valued free parameter, which controls the impact of uniformity versus variability of the averaged values. The re-evaluated associations resulting from this stage are probability distributions over the clusters denoted p*a0 ca1ya2 . We add an asterisk to distinguish this conditioned probability distribution from other pa0 ca1ya2 values that can be calculated directly from the output of the previous steps.</Paragraph>
    <Paragraph position="6">  (4) Calculating cross-partition &amp;quot;global&amp;quot; centroids:  based on the probability distributions p*a0 ca1ya2 , we calculate a probability distribution p*a0 ya1ca2 for every cluster c through a straightforward application of Bayes rule, obtaining the cross partition cluster centroids.</Paragraph>
    <Paragraph position="7"> The novelty of the CP algorithm lies in step (3): rather than deriving cluster centroids directly, as in the standard k-means scheme, cluster-feature associations are biased by their prominence across the cluster projections over the different subsets. This way, only features that are prominent in the cluster across most subsets end up prominent in the eventual cluster centroid (computed in step 4). By incorporating for every c-y pair a product over all w's, independence of the feature-cluster associations from specific w values is ensured. This conforms to our target of capturing themes that cut across the pre-given partition and are not correlated with specific subsets.</Paragraph>
    <Paragraph position="8"> Employing a separate update step in order to accomplish the above direction implies deviation from the familiar cost-based scheme. Indeed, the CP method is not directed by a single cost function that globally quantifies the cross partition clustering task on the whole. Rather, there are four different &amp;quot;local&amp;quot; cost-terms, each articulating a different aspect of the task. As shown in the appendix, each of the update steps (1)-(4) reduces one of these four cost terms, under the assumption that values not modified by that step are held constant. This assumption of course does not hold as values that are not modified by a given step are modified by another. Hence, downward convergence (of any of the cost terms) is not guaranteed. However, empirical experimentation shows that the dynamics of the CP algorithm tend to stabilize on an equilibrial steady state, where the four different distributions produced by the algorithm balance each other, as illustrated in Figure 3. In fact, convergence occurred in all our text-based experiments (as well as in experiments with synthetic data; Marx et al., 2004).</Paragraph>
    <Paragraph position="9"> Manipulating the value of the a36 parameter works in practice for the CP method as it works for distributional clustering: increasing a36 along subsequent runs enables the formation of configurations of growing numbers of clusters. The CP framework introduces an additional parameter, a44 . Intuitively, step (3). As said, the geometric mean scheme promotes those c-y associations for which the pa0 ya1c,wa2 values are distributed evenly across the w's (for any fixed c and y). A low a44 would imply a relatively low penalty to those c-y combinations that are not distributed evenly across the w's, but it  ics of the CP framework versus that of distributional clustering. In distributional clustering convergence is onto a configuration where the two systems of distributions complementarily balance one another, bringing a cost term to a locally minimal value. In CP, stable configurations maintain balanced inter-dependencies (equilibrium) of four systems of probability distributions.</Paragraph>
    <Paragraph position="11"> entails also loss of more information compared to high . We experimented with values that are fixed during a whole sequence of runs, while only is gradually incremented (see Section 5).</Paragraph>
    <Paragraph position="12"> Likewise the optional incorporation of priors in the distributional clustering scheme (Figure 1), the CP framework detailed in Figure 2 encapsulates four different algorithmic variants: the prior terms (marked in Figure 2 with dotted underline) can be optionally added in steps (1) and/or (3) of the algorithm. As in the distributional clustering case, the inclusion of these terms corresponds to inclusion of cluster entropy in the corresponding cost terms (see Appendix). It is interesting to note that we introduced previously, on intuitive accounts, some of these variants separately. Here we term the three variations involving priors CPI (prior added in step (1) only, which is the same as the method described in Dagan et al., 2002), CPII (prior added in step (3) only) and CPIII (prior added in both steps; as the method in Marx et al., 2004). The version with no priors is denoted CP. Our formulation reveals that these are all special cases of the general CP framework described above.</Paragraph>
  </Section>
  <Section position="6" start_page="982" end_page="984" type="metho">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The data elements that we used for our experiments - religion related key terms - were automatically extracted from a pre-divided corpus addressing five religions: Buddhism, Christianity, Hinduism, Islam and Judaism. The clustered key-term set was pre-partitioned, correspondingly, to five disjoint subsets, one per religion w.3 In our experimental setting, the key term subsets for the different religions were considered disjoint, i.e., occurrences of the same word in different subsets were considered distinct elements. The set of features y consisted of words that co-occurred with key terms within 5 word window, truncated by sentence boundaries. About features, each occurring in all five sub-corpora, were selected.</Paragraph>
    <Paragraph position="1"> We survey below some results, which were produced by the plain (unprioired) CP algorithm with applied to all five religions together.</Paragraph>
    <Paragraph position="2"> First, we describe our findings qualitatively and afterwards we provide quantitative evaluation.</Paragraph>
    <Paragraph position="3"> 3 We use the dataset of Marx et al. (2004) - five sub-corpora, of roughly one million words each, consisting of introductory web pages, electronic journal papers and encyclopedic entries about the five religions; about key terms were extracted from each sub-corpus to form the clustered subsets.</Paragraph>
    <Section position="1" start_page="982" end_page="983" type="sub_section">
      <SectionTitle>
5.1 Cross-religion Themes
</SectionTitle>
      <Paragraph position="0"> We have found that even the coarsest partition of the data to two clusters was informative and illuminating. It revealed two major aspects that seem to be equally fundamental in the religion domain.</Paragraph>
      <Paragraph position="1"> We termed them the &amp;quot;spiritual aspect&amp;quot; and &amp;quot;establishment aspect&amp;quot; of Religion. The &amp;quot;spiritual&amp;quot; cluster incorporated terms related with theology, underlying concepts and personal religious experience. Many of the terms assigned to this cluster with highest probability, such as heaven, hell, soul, god and existence, were in common use of several religions, but it included also religion-specific words such as atman, liberation and rebirth (key concepts of Hinduism). The &amp;quot;establishment&amp;quot; cluster contained names of schools, sects, clergical positions and other terms connected to religious institutions, geo-political entities and so on. Terms assigned to this cluster with high probability were mainly religion specific: protestant, vatican, university, council in Christianity; conservative, reconstructionism, sephardim, ashkenazim in Judaism and so on (few terms though were common to several religions, for instance east and west). This two-theme partition was obtained persistently (also when the CP method was applied to pairs of religions rather than to all five). Hence, these aspects appear to be the two universal constituents of religion-related texts in general, to the level the data reflect faithfully this domain.</Paragraph>
      <Paragraph position="2"> Clusters of finer granularity still seem to capture fundamental, though more focused, themes. For example, the partition into seven clusters revealed the following topics (our titles): &amp;quot;schools&amp;quot;, &amp;quot;divinity&amp;quot;, &amp;quot;religious experience&amp;quot;, &amp;quot;writings&amp;quot;, &amp;quot;festivals and rite&amp;quot;, &amp;quot;material existence, sin, and suffering&amp;quot; and &amp;quot;family and education&amp;quot;. Figure 4 details the members of highest p c x values within each religion in each of the seven clusters.</Paragraph>
      <Paragraph position="3"> The relation between the seven clusters to the coarser two-cluster configuration can be described in soft-hierarchy terms: the &amp;quot;schools&amp;quot; cluster and, to some lesser extent &amp;quot;festivals&amp;quot; and &amp;quot;family&amp;quot;, are related with the &amp;quot;establishment aspect&amp;quot; reflected in the partition to two, while &amp;quot;divinity&amp;quot;, &amp;quot;religious experience&amp;quot; and &amp;quot;suffering&amp;quot; are clearly associated with the &amp;quot;spiritual aspect&amp;quot;. The remaining topic, &amp;quot;writings&amp;quot;, is equally associated with both. The probabilistic framework enabled the CP method to  cCLUSTER 1 &amp;quot;Schools&amp;quot; Buddhism: america asia japan west east korea india china tibet Christianity: orthodox protestant catholic west orthodoxy organization rome council america Hinduism: west christian religious civilization buddhism aryan social founder shaiva Islam: africa asia west east sunni shiah christian country civilization philosophy Judaism: reform conservative reconstructionism zionism orthodox america europe sephardim ashkenazim CLUSTER 2 &amp;quot;Divinity&amp;quot; Buddhism: god brahma Christianity: holy-spirit jesus-christ god father savior jesus baptize salvation reign Hinduism: god brahma Islam: god allah peace messenger jesus worship believing tawhid command Judaism: god hashem bless commandment abraham CLUSTER 3 &amp;quot;Religious Experience&amp;quot; Buddhism: phenomenon perception consciousness human concentration mindfulness physical liberation Christianity: moral human humanity spiritual relationship experience expression incarnation divinity Hinduism: consciousness atman human existence liberation jnana purity sense moksha Islam: spiritual human physical moral consciousness humanity exist justice life Judaism: spiritual human existence physical expression humanity experience moral connect CLUSTER 4 &amp;quot;Writings&amp;quot; Buddhism: pali-canon sanskrit sutra pitaka english translate chapter abhidhamma book Christianity: chapter hebrew translate greek newtestament book text old-testament luke Hinduism: rigveda gita sanskrit upanishad sutra smriti brahma-sutra scripture mahabharata Islam: chapter surah bible write translate hadith book language scripture Judaism: tanakh scripture mishnah book oral talmud bible write letter CLUSTER 5 &amp;quot;Festivals and Rite&amp;quot; Buddhism: full-moon celebration stupa ceremony sakya abbot ajahn robe retreat Christianity: easter tabernacle christmas sunday sabbath jerusalem pentecost city season Hinduism: puja ganesh festival ceremony durga rama pilgrimage rite temple Islam: kaabah id ramadan friday id-al-fitr haj mecah mosque salah Judaism: sukoth festival shavuot temple passover jerusalem rosh-hashanah temple-mount rosh-hodesh CLUSTER 6 &amp;quot;Sin, Suffering, Material Existence&amp;quot; Buddhism: lamentation water grief kill eat hell animal death heaven Christianity: fire punishment eat water animal lost hell perish lamb Hinduism: animal heaven earth death water kill demon birth sun Islam: water animal hell punishment paradise food pain sin earth Judaism: animal water eat kosher sin heaven death food forbid CLUSTER 7 &amp;quot;Family and Education&amp;quot; Buddhism: child friend son people family question learn hear teacher Christianity: friend family mother boy question woman problem learn child Hinduism: child question son mother family learn people teacher teach Islam: sister husband wife child family marriage mother woman brother Judaism: child marriage wife mother father women question family people  figuration of the religion data, including the first members - up to nine - of highest p c x within each religion in each cluster. Cluster titles were assigned by the authors for reference.</Paragraph>
      <Paragraph position="4"> cope with these composite relationships between the coarse partition and the finer one.</Paragraph>
      <Paragraph position="5"> It is interesting to have a notion of those features y with high p* c y , within each cluster c. We exemplify those typical features, for each one of the seven clusters, through four of the highest p* c y features (excluding those terms that function as both features and clustered terms): &amp;quot;schools&amp;quot; cluster: central, dominanta0 , mainstream, affiliate; &amp;quot;divinity&amp;quot; cluster: omnipotenta0 , almighty, mercy, infinite; &amp;quot;religious experience&amp;quot; cluster: intrinsic, mental, realm, mature; &amp;quot;writings&amp;quot; cluster: commentary, manuscript, dictionary, grammar; &amp;quot;festivals and rite&amp;quot; cluster: annual, funeral, rebuild, feast; &amp;quot;material existence, sin, and suffering&amp;quot; cluster: vegetable, insect, penalty, quench; &amp;quot;community and family&amp;quot; cluster: parent, nursing, spouse, elderly.</Paragraph>
      <Paragraph position="6"> We demonstratively focus on the two-cluster and seven-cluster, as these numbers are small enough to allow review of all clusters. Configurations of more clusters revealed additional subtopics, such as education, prayer and so on.</Paragraph>
      <Paragraph position="7"> There are some prominent points of correspondence between our findings to Ninian Smart's comparative religion classics Dimensions of the Sacred (1996). For instance, Smart's ritual dimension corresponds to our &amp;quot;festivals and rite&amp;quot; cluster and his experiential and emotional dimension corresponds to our &amp;quot;religious experience&amp;quot; cluster.</Paragraph>
    </Section>
    <Section position="2" start_page="983" end_page="984" type="sub_section">
      <SectionTitle>
5.2 Evaluation with Expert Data
</SectionTitle>
      <Paragraph position="0"> We evaluated the performance of our method against cross-religion key term clusters constructed manually by a team of three experts of comparative religion studies. Each manually produced clustering configuration referred to two of the five religions rather than to all five jointly, as in our qualitative review. We examined eight of the ten religion pairs that can be chosen from the total of  five. Each religion pair was addressed independently by two different experts using the same set of key terms (so the total number of contributed configurations was 16). Thus, we could also asses the level of agreement between experts.</Paragraph>
      <Paragraph position="1"> As an overlap measure we employed the Jaccard coefficient, which is the ratio n n n n , where: n is the number of term pairs assigned to the same cluster by both our method and the expert; n is the number of term pairs co-assigned by our method but not by the expert; n is the number of term pairs co-assigned by the expert but not by our method.</Paragraph>
      <Paragraph position="2"> As the Jaccard score relies on counts of individual term pairs, no assumption with regard to the suitable number of clusters is required. Hence, for each religion pair we produced with our method configurations of two to 16 clusters and calculated for each Jaccard scores based on the overlap with the relevant expert configurations. The scores obtained were averaged over the 15 configurations. The means, over all 16 experimental cases, of those average values are displayed in Table 1.</Paragraph>
      <Paragraph position="3"> We tested all four CP method variants, with different fixed values of the parameter. In addition, we evaluated results obtained by the priored version of distributional clustering (the IB method, Tishby et al., 1999; see Figure 1). Marx et al.</Paragraph>
      <Paragraph position="4"> (2004) mentioned Information Bottleneck with Side Information (IB-SI, Chechik &amp; Tishby, 2003) as a method capable - unlike standard distributional clustering - of capturing information regarding pre-partition to subsets, which makes this method a seemingly sensible alternative to the CP method. Therefore, we tested the IB-SI method as well, following the adaptation scheme to the CP setting described by Marx et al, with a fixed value of its parameter, (with higher values convergence did not take place in all experiments). As Table 1 shows, the different CP variants performed better than the alternatives. The CPIII varinat, with both prior types, was less robust to changes in value and seemed to be more sensitive to noise.</Paragraph>
      <Paragraph position="5"> The experimental part of this work demonstrates that the task of drawing thematic correspondences is challenging. In the particular domain that we have examined the level of agreement between experts seems to make it evident that the task is inherently subjective and just partly consensual. It  ods, examined over of the 16 religion-pair evaluation cases (incorporating mean Jaccard scores over 2-16 clustering configurations, see text). The differences between most CP variants and cross-expert agreement are not statistically significant. The differences between IB, IB-SI and CPIII with = 0.83 and expert agreement are significant (two-tailed t-test, df = 15, p &lt; ).</Paragraph>
      <Paragraph position="6">  Agreement between the experts: 0.462 is remarkable therefore that most variations of our method approximate rather closely the upper bound of the level of agreement between the experts. Further, we have shown the merit of promoting shared cross-subset patterns and neutralizing topic-specific regularities in a newly introduced dedicated computational step. Methods that do not consider this direction (IB) or that incorporate it within a more conventional cost based search (IB-SI) yield notably poorer performance.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="984" end_page="985" type="metho">
    <SectionTitle>
6 Disscussion
</SectionTitle>
    <Paragraph position="0"> In this paper, we studied and demonstrated the cross partition method, a computational framework that addresses the task of identifying analogies and correspondences in texts. Our approach to this problem bridges between cognitive observations regarding analogy making, which have inspired it, and unsupervised learning techniques.</Paragraph>
    <Paragraph position="1"> While previous cognitively-motivated computational frameworks required structured input (e.g. Falkenhainer et al., 1989), the CP method adapts distributional clustering (Pereira et al., 1993), a standard approach applicable to unstructured data.</Paragraph>
    <Paragraph position="2"> Unlike standard clustering, the CP method considers an additional source of information: pre-partition of the clustered data to several topical subsets (originated in different sub-corpora) between which a correspondence is drawn.</Paragraph>
    <Paragraph position="3"> The innovative aspect of the cross-partition method lies in distinguishing feature information that cuts across the given pre-partition to subsets  versus subset-specific information. In order to incorporate this aspect within distributional clustering, the CP method interleaves several update steps, each locally optimizing a different cost term. Our experiments demonstrate that the CP method is capable of revealing interesting and non-trivial corresponding themes in texts. The results obtained with most variants of the CP method, with suitable tuning of the parameters, outperform comparable methods - standard distributional clustering and the IB-SI method - and are rather close to the level of agreement between experts.</Paragraph>
    <Paragraph position="4"> The CP method revealed, along various resolution levels, meaningful themes that to our understanding can be considered prominent constituents of Religion. It would be an interesting challenge to apply the CP framework further for other tasks, possibly with more practical flavor, such as comparing and detecting commonalities between commercial products and firms, identifying equivalencies and precedents in legal cases and so on.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML