Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 979–986, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Generalized Framework for Revealing Analogous 
Themes across Related Topics 
 
 
Zvika Marx Ido Dagan Eli Shamir 
CS and AI Laboratory Computer Science Department School of Computer Science 
MIT Bar-Ilan University The Hebrew University 
Cambridge, MA 02139, US Ramat-Gan 52900, Israel Jerusalem 91904, Israel 
zvim@csail.mit.edu dagan@cs.biu.ac.il shamir@cs.huji.ac.il 
 
 
 
 
Abstract 
This work addresses the task of identify-
ing thematic correspondences across sub-
corpora focused on different topics.  We 
introduce an unsupervised algorithmic 
framework based on distributional data 
clustering, which generalizes previous ini-
tial works on this task.  The empirical re-
sults reveal interesting commonalities of 
different religions.  We evaluate the re-
sults through measuring the overlap of our 
clusters with clusters compiled manually 
by experts.  The tested variants of our 
framework are shown to outperform al-
ternative methods applicable to the task. 
1 Introduction 
The ability to identify analogies and correspon-
dences is one of the fascinating aspects of intelli-
gence.  Research in cognitive science has 
acknowledged the significance of this ability of 
human thinking, particularly in learning across dif-
ferent situations or domains where the common 
base to learning is not straightforward.  Several 
previous computational models of analogy making 
(e.g. Falkenhainer et al., 1989) suggested symbolic 
computational mechanisms for constructing de-
tailed mappings that connect corresponding ingre-
dients across analogized systems. 
This work explores the identification of thematic 
correspondences in texts through an extension of 
the well known data clustering problem.  Previous 
works aimed at identifying – through clusters of 
words – concepts, sub-topics or themes that are 
prominent within a corpus of texts (e.g., Pereira et 
al., 1993; Li, 2002; Lin and Pantel, 2002).  The 
current work deals with extending this line of re-
search to identify corresponding themes across a 
corpus pre-divided to several sub-corpora, which 
are focused on different, yet related, topics. 
This research task has been defined quite re-
cently (Dagan et al., 2002), and has not been ex-
plored extensively yet.  One could think, however, 
of many potential applications for drawing corre-
spondences across textual resources: comparison 
of related firms or products, identifying equivalen-
cies in news published in different countries, and 
so on.  The experimental part of our work deals 
with revealing correspondences between different 
religions: Buddhism, Christianity, Hinduism, Islam 
and Judaism.  Given a pre-partition of the corpus to 
sub-corpora, one for each religion, our method ex-
poses common aspects for all religions, such as 
sacred writings, festivals and suffering. 
The mechanism we employ directs correspond-
ing key terms in the different sub-corpora, such as 
names of festivals of different religions, to be in-
cluded in the same cluster.  Term clustering meth-
ods in general, and in this work in particular, rely 
on word co-occurrence statistics: terms sharing 
similar words co-occurrence statistics are clustered 
together.  Different topics, however, are character-
ized by distinctive terminology and typical word 
co-locations.  Therefore, given a pre-divided cor-
pus, similar co-occurrence patterns would typically 
be extracted from the same topical sub-corpus.  
When the terminology and typical phrases em-
ployed by each topic differ greatly (even if the top-
979
ics are essentially related, e.g. different religions), 
the tendency to form topic-specific clusters intensi-
fies regardless of factors that otherwise could have 
impact this tendency, such as the co-occurrence 
window size.  Consequently, corresponding key 
terms of different topics may not be assigned by a 
standard method to the same cluster, in contrast to 
our goal.  The method described in this paper aims 
precisely at this problem: it is designed to neutral-
ize salient co-occurrence patterns within each topi-
cal sub-corpus and to promote less salient patterns 
that are shared across the sub-corpora. 
In an earlier line of research we have formulated 
the above problem and addressed it within a prob-
abilistic vector-based setting, presenting two re-
lated heuristic algorithms (Dagan et al., 2002; 
Marx et al., 2004).  Here, we devise a general prin-
cipled distributional clustering paradigm for this 
problem, termed cross-partition clustering, and 
show that the earlier algorithms are special cases of 
the new framework.   
This paper proceeds as follows: Section 2 de-
scribes in more detail the cross-partition clustering 
problem. Section 3 reviews distributional data 
clustering methods, which form the basis to our 
algorithmic framework described in Section 4.  
Section 5 presents experimental results that reveal 
interesting themes common to different religions 
and demonstrates, through an evaluation based on 
human expert data, that the different variants of 
our framework outperform alternative methods. 
2 The cross-partition clustering problem 
The cross-partition clustering problem is an exten-
sion of the standard (single-set) data clustering 
problem.  In the cross-partition setting, the dataset 
is pre-partitioned into several distinct subsets of 
elements to be clustered.  For example, in our ex-
periments each of these subsets consisted of topical 
key terms to be clustered.  Each such subset was 
extracted automatically from a sub-corpus corre-
sponding to a different religion (see Section 5). 
As in the standard clustering problem, our goal 
is to cluster the data such that each term cluster 
would capture a particular theme in the data.  
However, the generated clusters are expected to 
identify themes that cut across all the given sub-
sets.  For example, one cluster consists of names of 
festivals of different religions, such as Easter, 
Christmas, Sunday (Christianity) Ramadan, Fri-
day, Id-al-fitr (Islam) and Sukoth, Shavuot, Pass-
over (Judaism; see Figure 4 for more examples). 
3 Distributional clustering 
Our algorithmic framework elaborates on Pereira 
et al.’s (1993) distributional clustering method.  
Distributional clustering probabilistically clusters 
data elements according to the distribution of a 
given set of features associated with the data.  Each 
data element x is represented as a probability dis-
tribution pa0 ya1xa2  over all features y.  In our data 
pa0 ya1xa2  is the empirical co-occurrence frequency of a 
feature word y with a key term x, normalized over 
all feature word co-occurrences with x. 
The distributional clustering algorithmic scheme 
(Figure 1) is a probabilistic (soft) version of the 
well-known K-means algorithm.  It iteratively al-
ternates between: 
(1) Calculating assignments to clusters: calculate 
an assignment probability pa0 ca1xa2  for each data ele-
ments x into each one of the clusters c.  This soft 
assignment is proportional to an information theo-
retic distance (KL divergence) between the ele-
ment's pa0 ya1xa2  representation, and the centroid of c, 
represented by a distribution pa0 ya1ca2 .  The marginal 
cluster probability pa0 ca2  may optionally be set as a 
prior in this calculation, as in Tishby et al. (1999; 
in Figure 1 we mark it with dotted underline, to 
denote it is optional).  
Set t a3a5a4 , and repeatedly iterate the two update-steps 
below, till convergence (at time step t a3a6a4 , initialize 
pt
a7
ca8xa9  randomly or arbitrarily and skip step 1):
 (1) pt
a10
ca11xa12a14a13
a15 a16
),()(
)|()|(
1
1
a17
a18
xz
ecp
t
cypxypKL
t
ta19a20
a20
     where   zt
a10
xa21a22a23a12a24a13 a25 a26a27 a28a29
a29'
)'|(||)|(
1
1)'(
c
cypxypKL
t
tecp
a30
 (2) pt
a10
ya11ca12a31a13 a32
x
t
t
xypxcpxpcp )|()|()()(1
 
     where    pt
a10
ca12a33a13 a27 x t xcpxp )|()(  
 (3) t a13 t a34a5a35
Figure 1: A general formulation of the iterative 
distributional clustering algorithm (with a fixed 
a36  value and a fixed number of clusters).  The 
underlined pt
a37a39a38a37a39a38a37a39a38a37a39a38 a40a40a40a40
ca41a41a41a41  term is optional. 
980
 (2) Calculating cluster centroids: calculate a 
probability distribution pa0 ya1ca2  over all features y 
given each cluster c, based on the feature distribu-
tion of cluster elements, weighed by the pa0 ca1xa2  as-
signment probability calculated in step (1) above.  
This step imposes the independence of the clus-
ters c of the features y given the data x (similarly 
to the naïve Bayes supervised framework). 
Subsequent works (Tishby et al., 1999; Gedeon et 
al., 2003) have studied and motivated further the 
earlier distributional clustering method.  Particu-
larly, it can be shown that the algorithm of Figure 
1 locally minimizes the following cost function: 
Fdist-clust a0 H
a10
Ca12 a1 Ha0 Ca1Xa2 a2 a36  Ha0 Ya1Ca2 , (1) 
where H denotes entropy1 and X, Y and C are for-
mal variables whose values range over all data 
elements, features and clusters, respectively. 
Tishby et al.’s (1999) information bottleneck 
method (IB) includes the marginal cluster entropy 
H
a10
Ca12  in the cost term2 (it is marked with dotted un-
derline to denote its inclusion is optional, so that 
Eq. (1) encapsulates two different cost terms).  The 
addition of H
a10
Ca12  corresponds to including the op-
tional prior term pt
a37 a38a40
ca41  in step (1) of the algorithm. 
The parameter a36  that appears in the cost term 
and in step (1) of the algorithm can have any posi-
tive real value.  It counterbalances the relative im-
pact of the considerations of maximizing feature 
information conveyed by the partition to clusters, 
i.e. minimizing Ha0 Ya1Ca2 , versus applying the maxi-
mum entropy principle  to the cluster assignment 
probabilities (see Gedeon et al., 2004), i.e., maxi-
mizing Ha0 Ca1Xa2 .  The higher a36  is, the more “deter-
mined” the algorithm becomes in assigning each 
element into the most appropriate cluster.  In sub-
sequent runs of the algorithm a36  can be increased, 
yielding more separable clusters (clusters with no-
ticeably different centroids) upon convergence.  
The runs can repeat until, for some a36 , the desired 
number of separate clusters is obtained. 
4 The cross-partition clustering method  
In the cross-partition framework, the pre-partition 
of the data to subsets is captured through an addi-
                                                          
1 The entropy of a random variable A is H
a3 Aa4a6a5a8a7 a
a9b
pa3 aa4 log pa3 aa4 , where a ranges 
over all values of A; the entropy of A conditioned on another variable B is 
Ha3 Aa10 Ba4a6a5a8a7 a
a9b
pa3 aa11 ba4 log pa3 aa10ba4 , with a and b range over all values of A and B. 
2 The IB cost function was originally formulated as FIB  
a12   Ia3 Ca13 Xa4a6a14a16a15 Ia3 Ca13 Ya4 . 
This formulation is equivalent to ours, as Ia3 Ca13 Xa4a6a5 Ha3 Ca4a6a14 Ha3 Ca10Xa4  and Ia3 Ca13 Ya4a17a5  
Ha3 Ya4a18a14  Ha3 Ya10Ca4 , while Ha3 Ya4  is a constant term depending only on the data. 
tional formal variable W, whose values range over 
the subsets.  In our data, each religion corresponds 
to a different W value, w.  Each religion-related 
key term x is associated with one religion w, with 
pa0 wa1xa2a20a19a22a21  (and pa0 w'a1xa2a23a19 0 for any w' a24 w).  For-
mally, our framework allows probabilistic pre-
partition, i.e., pa0 wa1xa2  values between 0 and 1 but 
this option was not examined empirically. 
The Cross-Partition (CP) clustering method 
(Figure 2) is an extended version of the probabilis-
tic K-means scheme, introducing additional steps 
in the iterative loop that incorporate the added pre-
partition variable W: 
(1) Calculating assignments to clusters, i.e. prob-
abilistic pa0 ca1xa2  values, is based on current values 
of cluster centroids, as in distributional clustering. 
(2) Calculating subset-projected cluster centroids. 
Given the current element assignments, centroids 
are  computed separately for each  combination of 
Set t a3a6a4  and repeatedly iterate the following update 
steps sequence, till convergence (in the first iteration, 
when  t a3a6a4  randomly or arbitrarily initialize pt
a7
ca8xa9  
and skip step CP1): 
 (1) pt
a10
ca11xa12a31a13
a25 a26
),()(
)|(*)|(
1
1
a27
a28
xz
ecp
t
cypxypKL
t
ta29a30
a30
     where zt
a10
xa21a22a23a12a33a13 a31 a32a27 a29a30
a30'
)'|(*||)|(
1
1)'(
c
cypxypKL
t
tecp
a28
 (2)  pt
a10
ya11ca21 wa12a24a13 a32
x
t
t
xwpxypxcpxpwcp )|()|()|()(),(1
     where pt
a10
ca21 wa12a33a13 a27 x t xwpxcpxp )|()|()(
 (3)  p*t
a10
ca11ya12 a13
),(*
),|()(* )(
1 a33
a34
yz
wcypcp
t
w
wp
t
t
a35
a36
    where z*t
a10
ya21a38a37 a12a33a13 a39 a40
a30'
)(
1 ),'|()'(c
wp
twt wcypcp a41
(4)  p*t
a10
ya11ca12 a13 a42
y
t
t
ycpypcp )|(*)()(*1
     where   p*t
a10
ca12a33a13 a43 y t ycpyp )|(*)(
 (5)  t a13 t a34a5a35  
Figure 2: The cross partition clustering iterative 
algorithm (with fixed a36  and a44  values and a fixed 
number of clusters).  The terms marked by dotted 
underline, pt
a37a39a38a40
ca41   and p*t
a40
ca41 , are optional. 
981
a cluster c projected on a pre-given subset w.  
Each such subset-projected centroid is given by a 
probability distribution pa0 ya1c,wa2  over the features 
y, for each c and w separately (instead of pa0 ya1ca2 . 
(3) Re-evaluating cluster-feature association.  
Based on the subset projected centroids, the asso-
ciations between features and clusters are re-
evaluated: features that are commonly prominent 
across all subsets are promoted relatively to fea-
tures with varying prominence.  A weighted geo-
metric mean scheme achieves this effect: the 
value of a0 w pa0 ya1c,wa2a2a1 pa3 wa4  is larger as the different 
pa0 ya1c,wa2  values are distributed more uniformly 
over the different w's, for any given c and y.  a44  is 
a positive valued free parameter, which controls 
the impact of uniformity versus variability of the 
averaged values.  The re-evaluated associations 
resulting from this stage are probability distribu-
tions over the clusters denoted p*a0 ca1ya2 .  We add an 
asterisk to distinguish this conditioned probability 
distribution from other pa0 ca1ya2  values that can be 
calculated directly from the output of the previous 
steps. 
(4) Calculating cross-partition “global” centroids: 
based on the probability distributions p*a0 ca1ya2 , we 
calculate   a   probability  distribution  p*a0 ya1ca2    for 
every cluster c through a straightforward applica-
tion of Bayes rule, obtaining the cross partition 
cluster centroids. 
The novelty of the CP algorithm lies in step (3): 
rather than deriving cluster centroids directly, as in 
the standard k-means scheme, cluster-feature asso-
ciations are biased by their prominence across the 
cluster projections over the different subsets.  This 
way, only features that are prominent in the cluster 
across most subsets end up prominent in the even-
tual cluster centroid (computed in step 4).  By in-
corporating for every c–y pair a product over all 
w's, independence of the feature-cluster associa-
tions from specific w values is ensured.  This con-
forms to our target of capturing themes that cut 
across the pre-given partition and are not corre-
lated with specific subsets. 
Employing a separate update step in order to ac-
complish the above direction implies deviation 
from the familiar cost-based scheme.  Indeed, the 
CP method is not directed by a single cost function 
that globally quantifies the cross partition cluster-
ing task on the whole.  Rather, there are four dif-
ferent “local” cost-terms, each articulating a 
different aspect of the task.  As shown in the ap-
pendix, each of the update steps (1)–(4) reduces 
one of these four cost terms, under the assumption 
that values not modified by that step are held con-
stant.  This assumption of course does not hold as 
values that are not modified by a given step are 
modified by another.  Hence, downward conver-
gence (of any of the cost terms) is not guaranteed. 
However, empirical experimentation shows that 
the dynamics of the CP algorithm tend to stabilize 
on an equilibrial steady state, where the four dif-
ferent distributions produced by the algorithm bal-
ance each other, as illustrated in Figure 3.  In fact, 
convergence occurred in all our text-based experi-
ments (as well as in experiments with synthetic 
data; Marx et al., 2004). 
Manipulating the value of the a36  parameter works 
in practice for the CP method as it works for distri-
butional clustering: increasing a36  along subsequent 
runs enables the formation of configurations of 
growing numbers of clusters.  The CP framework 
introduces an additional parameter, a44 .  Intuitively, 
step (3).  As said, the geometric mean scheme 
promotes those c–y associations for which the 
pa0 ya1c,wa2  values are distributed evenly across the w's 
(for any fixed c and y).  A low a44  would imply a 
relatively low penalty to those c–y combinations 
that are not distributed evenly across the w's,  but it  
 
 
Figure 3: A schematic illustration of the dynam-
ics of the CP framework versus that of distribu-
tional clustering.  In distributional clustering 
convergence is onto a configuration where the 
two systems of distributions complementarily 
balance one another, bringing a cost term to a lo-
cally minimal value.  In CP, stable configurations 
maintain balanced inter-dependencies (equilib-
rium) of four systems of probability distributions. 
p*(y|c) p*(c|y) 
p(c|x) 
p(y|c) 
p(c|x) 
p(y|c,w) 
    	  	 
 
 
   	 
  
 	  	  
  	 
   
 
   	 
  
982
entails also loss of more information compared to 
high .  We experimented with  values that are 
fixed during a whole sequence of runs, while only 
	  is gradually incremented (see Section 5).  
Likewise the optional incorporation of priors in 
the distributional clustering scheme (Figure 1), the 
CP framework detailed in Figure 2 encapsulates 
four different algorithmic variants: the prior terms 
(marked in Figure 2 with dotted underline) can be 
optionally added in steps (1) and/or (3) of the algo-
rithm.  As in the distributional clustering case, the 
inclusion of these terms corresponds to inclusion 
of cluster entropy in the corresponding cost terms 
(see Appendix).  It is interesting to note that we 
introduced previously, on intuitive accounts, some 
of these variants separately.  Here we term the 
three variations involving priors CPI (prior added 
in step (1) only, which is the same as the method 
described in Dagan et al., 2002), CPII (prior added 
in step (3) only) and CPIII (prior added in both 
steps; as the method in Marx et al., 2004).  The 
version with no priors is denoted CP.  Our formu-
lation reveals that these are all special cases of the 
general CP framework described above. 
5 Experimental Results 
The data elements that we used for our experi-
ments – religion related key terms – were auto-
matically extracted from a pre-divided corpus 
addressing five religions: Buddhism, Christianity, 
Hinduism, Islam and Judaism.  The clustered key-
term set was pre-partitioned, correspondingly, to 
five disjoint subsets, one per religion  w.3  In our 
experimental setting, the key term subsets for the 
different religions were considered disjoint, i.e., 
occurrences of the same word in different subsets 
were considered distinct elements.  The set of fea-
tures y consisted of words that co-occurred with 
key terms within  5 word window, truncated by 
sentence boundaries.  About      features, each 
occurring in all five sub-corpora, were selected. 
We survey below some results, which were pro-
duced by the plain (unprioired) CP algorithm with 
        applied to all five religions together.  
First, we describe our findings qualitatively and 
afterwards we provide quantitative evaluation. 
                                                          
3 We use the dataset of Marx et al. (2004) – five sub-corpora, of roughly one 
million words each, consisting of introductory web pages, electronic journal 
papers and encyclopedic entries about the five religions; about     key terms 
were extracted from each sub-corpus to form the clustered subsets. 
5.1 Cross-religion Themes 
We have found that even the coarsest partition of 
the data to two clusters was informative and illu-
minating.  It revealed two major aspects that seem 
to be equally fundamental in the religion domain.  
We termed them the “spiritual aspect” and “estab-
lishment aspect” of Religion.  The “spiritual” clus-
ter incorporated terms related with theology, 
underlying concepts and personal religious experi-
ence.  Many of the terms assigned to this cluster 
with highest probability, such as heaven, hell, soul, 
god and existence, were in common use of several 
religions, but it included also religion-specific 
words such as atman, liberation and rebirth (key 
concepts of Hinduism).  The “establishment” clus-
ter contained names of schools, sects, clergical po-
sitions and other terms connected to religious 
institutions, geo-political entities and so on.  Terms 
assigned to this cluster with high probability were 
mainly religion specific: protestant, vatican, uni-
versity, council in Christianity; conservative, re-
constructionism, sephardim, ashkenazim in 
Judaism and so on (few terms though were com-
mon to several religions, for instance east and 
west).  This two-theme partition was obtained per-
sistently (also when the CP method was applied to 
pairs of religions rather than to all five). Hence, 
these aspects appear to be the two universal con-
stituents of religion-related texts in general, to the 
level the data reflect faithfully this domain. 
Clusters of finer granularity still seem to capture 
fundamental, though more focused, themes.  For 
example, the partition into seven clusters revealed 
the following topics (our titles): “schools”, “divin-
ity”, “religious experience”, “writings”, “festivals 
and rite”, “material existence, sin, and suffering” 
and “family and education”.  Figure 4 details the 
members of highest p c x  values within each relig-
ion in each of the seven clusters. 
The relation between the seven clusters to the 
coarser two-cluster configuration can be described 
in soft-hierarchy terms:  the “schools” cluster and, 
to some lesser extent “festivals” and “family”, are 
related with the “establishment aspect”  reflected in 
the partition to two, while “divinity”, “religious 
experience” and  “suffering” are  clearly associated 
with the “spiritual aspect”.  The remaining topic, 
“writings”, is equally associated with both.  The 
probabilistic framework  enabled the  CP method to 
  
983
cCLUSTER 1 “Schools” 
Buddhism: america asia japan west east korea india 
china tibet 
Christianity: orthodox protestant catholic west  
orthodoxy organization rome council america  
Hinduism: west christian religious civilization  
buddhism aryan social founder shaiva  
Islam: africa asia west east sunni shiah christian 
country civilization philosophy 
Judaism: reform conservative reconstructionism zion-
ism orthodox america europe sephardim ashkenazim 
CLUSTER 2 “Divinity” 
Buddhism: god brahma 
Christianity: holy-spirit jesus-christ god father 
savior jesus baptize salvation reign 
Hinduism: god brahma 
Islam: god allah peace messenger jesus worship   
believing tawhid command 
Judaism: god hashem bless commandment abraham 
CLUSTER 3 “Religious Experience” 
Buddhism: phenomenon perception consciousness human 
concentration mindfulness physical liberation 
Christianity: moral human humanity spiritual rela-
tionship experience expression incarnation divinity 
Hinduism: consciousness atman human existence lib-
eration jnana purity sense moksha 
Islam: spiritual human physical moral consciousness 
humanity exist justice life 
Judaism: spiritual human existence physical expres-
sion humanity experience moral connect 
CLUSTER 4 “Writings” 
Buddhism: pali-canon sanskrit sutra pitaka english 
translate chapter abhidhamma book 
Christianity: chapter hebrew translate greek new-
testament book text old-testament luke 
Hinduism: rigveda gita sanskrit upanishad sutra 
smriti brahma-sutra scripture mahabharata 
Islam: chapter surah bible write translate hadith 
book language scripture 
Judaism: tanakh scripture mishnah book oral talmud 
bible write letter 
CLUSTER 5 “Festivals and Rite” 
Buddhism: full-moon celebration stupa ceremony sakya 
abbot ajahn robe retreat 
Christianity: easter tabernacle christmas sunday 
sabbath jerusalem pentecost city season 
Hinduism: puja ganesh festival ceremony durga rama 
pilgrimage rite temple 
Islam: kaabah id ramadan friday id-al-fitr haj mecah 
mosque salah 
Judaism: sukoth festival shavuot temple passover 
jerusalem rosh-hashanah temple-mount rosh-hodesh 
CLUSTER 6 “Sin, Suffering, Material Existence” 
Buddhism: lamentation water grief kill eat hell  
animal death heaven 
Christianity: fire punishment eat water animal lost 
hell perish lamb  
Hinduism: animal heaven earth death water kill demon 
birth sun 
Islam: water animal hell punishment paradise food 
pain sin earth 
Judaism: animal water eat kosher sin heaven death 
food forbid 
CLUSTER 7 “Family and Education” 
Buddhism: child friend son people family question 
learn hear teacher 
Christianity: friend family mother boy question 
woman problem learn child 
Hinduism: child question son mother family learn 
people teacher teach 
Islam: sister husband wife child family marriage 
mother woman brother 
Judaism: child marriage wife mother father women 
question family people 
 
Figure 4: A sample from a seven-cluster CP con-
figuration of the religion data, including the first 
members – up to nine – of highest p c x  within 
each religion in each cluster.  Cluster titles were 
assigned by the authors for reference. 
cope with these composite relationships between 
the coarse partition and the finer one. 
It is interesting to have a notion of those features 
y with high p* c y , within each cluster c.  We ex-
emplify those typical features, for each one of the 
seven clusters, through four of the highest p* c y  
features (excluding those terms that function as 
both features and clustered terms):  
   “schools” cluster: 
central, dominanta0 , mainstream, affiliate;  
   “divinity” cluster:  
omnipotenta0 , almighty, mercy, infinite;  
   “religious experience” cluster:  
intrinsic, mental, realm, mature;  
   “writings” cluster:  
commentary, manuscript, dictionary, grammar;  
   “festivals and rite” cluster:  
annual, funeral, rebuild, feast;  
   “material existence, sin, and suffering” cluster:  
vegetable, insect, penalty, quench; 
   “community and family” cluster:  
parent, nursing, spouse, elderly.  
We demonstratively focus on the two-cluster 
and seven-cluster, as these numbers are small 
enough to allow review of all clusters.  Configura-
tions of more clusters revealed additional sub-
topics, such as education, prayer and so on. 
There are some prominent points of correspon-
dence between our findings to Ninian Smart’s 
comparative religion classics Dimensions of the 
Sacred (1996).  For instance, Smart’s ritual dimen-
sion corresponds to our “festivals and rite” cluster 
and his experiential and emotional dimension cor-
responds to our “religious experience” cluster. 
5.2 Evaluation with Expert Data  
We evaluated the performance of our method 
against cross-religion key term clusters constructed 
manually by a team of three experts of comparative 
religion studies.  Each manually produced cluster-
ing configuration referred to two of the five relig-
ions rather than to all five jointly, as in our 
qualitative review.  We examined eight of the ten 
religion pairs that can be chosen from the total of 
984
five.  Each religion pair was addressed independ-
ently by two different experts using the same set of 
key terms (so the total number of contributed con-
figurations was 16).  Thus, we could also asses the 
level of agreement between experts.  
As an overlap measure we employed the Jaccard 
coefficient, which is the ratio n    n  
  n  
  n   , 
where: 
n   is the number of term pairs assigned to the 
same cluster by both our method and the expert; 
n   is the number of term pairs co-assigned by our 
method but not by the expert; 
n   is the number of term pairs co-assigned by the 
expert but not by our method. 
As the Jaccard score relies on counts of individ-
ual term pairs, no assumption with regard to the 
suitable number of clusters is required.  Hence, for 
each religion pair we produced with our method 
configurations of two to 16 clusters and calculated 
for each Jaccard scores based on the overlap with 
the relevant expert configurations.  The scores ob-
tained were averaged over the 15 configurations.  
The means, over all 16 experimental cases, of 
those average values are displayed in Table 1. 
We tested all four CP method variants, with dif-
ferent fixed values of the  parameter.  In addition, 
we evaluated results obtained by the priored ver-
sion of distributional clustering (the IB method, 
Tishby et al., 1999; see Figure 1).  Marx et al. 
(2004) mentioned Information Bottleneck with 
Side Information (IB-SI, Chechik & Tishby, 2003) 
as a method capable – unlike standard distribu-
tional clustering – of capturing information regard-
ing pre-partition to subsets, which makes this 
method a seemingly sensible alternative to the CP 
method.  Therefore, we tested the IB-SI method as 
well, following the adaptation scheme to the CP 
setting described by Marx et al, with a fixed value 
of its parameter,         (with higher values con-
vergence did not take place in all experiments).  As 
Table 1 shows, the different CP variants performed 
better than the alternatives.  The CPIII varinat, with 
both prior types, was less robust to changes in  
value and seemed to be more sensitive to noise. 
The experimental part of this work demonstrates 
that the task of drawing thematic correspondences 
is challenging.   In the particular domain that we 
have examined the level of agreement between 
experts seems to make it evident that the task is 
inherently subjective  and just partly consensual.  It 
Table 1:  Mean Jaccard scores for several meth-
ods, examined over of the 16 religion-pair 
evaluation cases (incorporating mean Jaccard 
scores over 2–16 clustering configurations, see 
text).  The differences between most CP variants 
and cross-expert agreement are not statistically 
significant.  The differences between IB, IB-SI 
and CPIII with  = 0.83 and expert agreement are 
significant (two-tailed t-test, df = 15, p <     ). 
  = 0.48  = 0.56  = 0.67  = 0.83 
CP 0.405 0.383 0.400 0.394 
CPI 0.416 0.400 0.415 0.399 
CPII 0. 410 0.387 0.409 0.417 
CPIII 0. 405 0.420 0.370 0.293 
IB:   0.1734 IB-SI (  = 0.07):   0.1995 
Agreement between the experts:   0.462 
is remarkable therefore that most variations of our 
method approximate rather closely the upper 
bound of the level of agreement between the ex-
perts.  Further, we have shown the merit of pro-
moting shared cross-subset patterns and 
neutralizing topic-specific regularities in a newly 
introduced dedicated computational step.  Methods 
that do not consider this direction (IB) or that in-
corporate it within a more conventional cost based 
search (IB-SI) yield notably poorer performance. 
6 Disscussion  
In this paper, we studied and demonstrated the 
cross partition method, a computational framework 
that addresses the task of identifying analogies and 
correspondences in texts.  Our approach to this 
problem bridges between cognitive observations 
regarding analogy making, which have inspired it, 
and unsupervised learning techniques. 
While previous cognitively-motivated computa-
tional frameworks required structured input (e.g. 
Falkenhainer et al., 1989), the CP method adapts 
distributional clustering (Pereira et al., 1993), a 
standard approach applicable to unstructured data.  
Unlike standard clustering, the CP method consid-
ers an additional source of information: pre-
partition of the clustered data to several topical 
subsets (originated in different sub-corpora) be-
tween which a correspondence is drawn.   
The innovative aspect of the cross-partition 
method lies in distinguishing feature information 
that cuts across the given pre-partition to subsets 
985
versus subset-specific information.  In order to in-
corporate this aspect within distributional cluster-
ing, the CP method interleaves several update 
steps, each locally optimizing a different cost term. 
Our experiments demonstrate that the CP 
method is capable of revealing interesting and non-
trivial corresponding themes in texts.  The results 
obtained with most variants of the CP method, 
with suitable tuning of the parameters, outperform 
comparable methods – standard distributional clus-
tering and the IB-SI method – and are rather close 
to the level of agreement between experts. 
The CP method revealed, along various resolu-
tion levels, meaningful themes that to our under-
standing can be considered prominent constituents 
of Religion.  It would be an interesting challenge to 
apply the CP framework further for other tasks, 
possibly with more practical flavor, such as com-
paring and detecting commonalities between 
commercial products and firms, identifying equiva-
lencies and precedents in legal cases and so on. 
References  
Gal Chechik and Naftali Tishby. 2003.  Extracting rele-
vant structures with side information.   In S. Becker, 
S. Thrun, and K. Obermayer (eds.), Advances in Neu-
ral Processing Information Systems 15 (NIPS 2002), 
pp. 857-864. 
Ido Dagan, Zvika Marx and Eli Shamir. 2002. Cross-
dataset clustering: Revealing corresponding themes 
across multiple corpora.  In D. Roth and A. van den 
Bosch (eds.), Proceedings of the 6th Conference on 
Natural Language Learning (CoNLL-2002), pp. 15-
21. 
Brian Falkenhainer, Kenneth D. Forbus and Dedre 
Gentner. 1989.  The structure mapping engine: Algo-
rithm and example. Artificial Intelligence, 41(1):1-
63. 
Tomas Gedeon, Albert E. Parker, and Alexander G. 
Dimitrov, 2003. Information distortion and neural 
coding. Canadian Applied Mathematics Quarterly 
10(1):33-70. 
Hang Li.  2002.  Word Clustering and Disambiguation 
based on co-occurrence data, Natural Language En-
gineering, 8(1):25-42. 
Zvika Marx, Ido Dagan and Eli Shamir. 2004.  Identify-
ing structure across pre-partitioned data.  In S. Thrun, 
L. Saul, and B. Schölkopf (eds.), Advances in Neural 
Information Processing Systems 16 (NIPS 2003), pp. 
489-496. 
Dekang Lin and Patrick Pantel.  2002. Concept Discov-
ery from Text. In Proceedings of Conference on 
Computational Linguistics (COLING-02), pp. 577-
583. 
Fernando C. Pereira, Nftali Tishby and L. J. Lee.  1993.  
Distributional clustering of English words. In Pro-
ceedings of the 31st Annual Meeting of the Associa-
tion for Computational Linguistics ACL '93, pp. 183-
190.   
Ninian Smart. 1996. Dimensions of the Sacred: An 
Anatomy of the World's Beliefs.  University of Cali-
fornia Press, Berkeley and Los Angeles, CA. 
Naftali Tishby, Fernando C. Pereira and William Bialek. 
1999.  The information bottleneck method.  In 37th 
Annual Allerton Conference on Communication, 
Control, and Computing, pp. 368-379. 
Appendix 
This appendix specifies the four “local” cost terms men-
tioned in Section 4 and describes how the CP algo-
rithmic framework (Fig. 2) modifies them. 
Step (1) of the CP framework computes p c x  values 
that reduce the value of the following term: 
FCP1        H C   
  H C X 
 	 a0 *Y C  , 
where a0 *Y C    
   x px   c pc x   y py x  log p*y c .  
The p*y c  values are considered as if they are constant. 
Step (2) computes pc x  values reducing the value of 
FCP2   
   x px   c pc x   y py x   w pw x  log py c w   
which is equal to HY C W , subject to an independence 
assumption extending the assumption explicated in  
footnote 4, namely for each feature y, cluster c, and pre-
given subset w: pc y,w     x px  pc x  py x  pw x . 
Step (3) finds p*c y  values that reduce the value of 
FCP3        H*C  
  H*C Y  
    a0 Y C W  , 
where H*C Y      
   y py   c p*c y  log p*c y  and 
a0 Y C W      
  
w pw   y py   c p*c y  log py c w  , 
which is equal to the conditioned entropy HY C W  
under an assumption that W is independent of C and Y.  
The py c,w  values in this term are considered as if they 
are held constant. 
Step (4) finds p*y c  values that reduce the value of 
FCP4      
   y py   c p*c y  log p*y c  , 
which can be denoted H*Y C . The p*c y  values 
are considered as if they are constant. 
The underlined HC  and H*C  terms in FCP1 and FCP3 
are optional; there inclusion implies the inclusion of the 
prior terms in steps (1) and (3) of the algorithm (see 
Figure 2). 
986
