File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2009_intro.xml

Size: 4,402 bytes

Last Modified: 2025-10-06 14:01:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2009">
  <Title>Cross-dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper addresses the problem of detecting corresponding subtopics, or themes, within related bodies of text. Such task is typical to comparative research, whether commercial or scientific: a conceivable application would aim at detecting corresponding characteristics regarding, e.g., companies, markets, legal systems or political organizations.</Paragraph>
    <Paragraph position="1"> Clustering has often been perceived as a mean for extracting meaningful components from data (Tishby, Pereira and Bialek, 1999). Regarding textual data, clusters of words (Pereira, Tishby and Lee, 1993) or documents (Lee and Seung, 1999; Dhillon and Modha, 2001) are often interpreted as capturing topics or themes that play prominent role in the analyzed texts.</Paragraph>
    <Paragraph position="2"> Our work extends the standard clustering paradigm, which pertains to a single dataset.</Paragraph>
    <Paragraph position="3"> We address a setting in which several datasets, corresponding to related domains, are given.</Paragraph>
    <Paragraph position="4"> We focus on the comparative task of detecting those themes that are expressed across several datasets, rather than discovering internal themes within each individual dataset.</Paragraph>
    <Paragraph position="5"> More specifically, we address the task of clustering simultaneously multiple datasets such that each cluster includes elements from several datasets, capturing a common theme, which is shared across the sets. We term this task cross-dataset (CD) clustering.</Paragraph>
    <Paragraph position="6"> In this article we demonstrate CD clustering through detecting corresponding themes across three different religions. That is: we apply our approach to three sets of religion-related keywords, extracted from three corpora, which include encyclopedic entries and introductory articles regarding Buddhism, Christianity and Islam. Each one of three representative keyword-sets, which were extracted from the above corpora, presumably encapsulates topics and themes discussed within its source corpus.</Paragraph>
    <Paragraph position="7"> Our algorithm succeeds to reveal common themes such as scriptures, rituals and schools through respective keyword clusters consisting of terms such as Sutra, Bible and Quran; Full Moon, Easter and Id al Fitr; Theravada, Protestant and Shiite (see Table 1 below for a detailed depiction of our results).</Paragraph>
    <Paragraph position="8"> The CD clustering algorithm, presented in Section 2.2 below, extends the information bottleneck (IB) soft clustering method. Our modifications to the IB formulation enable the clustering algorithm to capture characteristic patterns that run across different datasets, rather then being trapped by unique characteristics of individual datasets.</Paragraph>
    <Paragraph position="9"> Like other topic discovery tasks that are approached by clustering, the goal of CD clustering is not defined in precise terms. Yet, it is clear that its focus on detecting themes in a comparative manner, within multiple datasets, distinguishes CD clustering substantially from the standard single-dataset clustering paradigm. A related problem, of detecting analogies between different information systems has been addressed in the past within cognitive research (e.g. Gentner, 1983; Hofstadter et al., 1995).</Paragraph>
    <Paragraph position="10"> Recently, a related computational method for detecting corresponding themes has been introduced (coupled clustering, Marx et al., 2002). The coupled clustering setting, however, being focused on detecting analogies, is limited to two data sets. Further, it requires similarity values between pairs of data elements as input: this setting does not seem straightforwardly applicable to the multiple dataset setting. Our method, in distinction, uses a more direct source of information, namely word co-occurrence statistics within the analyzed corpora. Another difference is that we take the soft approach to clustering, producing probabilities of assignments into clusters rather than a deterministic 0/1 assignment values.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML