File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0907_metho.xml
Size: 21,042 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0907"> <Title>Detecting Sub-Topic Correspondence through Bipartite Term Clustering</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Introduction: Corresponding Entities in Text Fragments </SectionTitle> <Paragraph position="0"> Information technology is continuously challenged by growing demand for accurate performance in fields such as data retrieval, document classification and knowledge representation. A typical task in these areas requires well-developed capabilities of assessing similarity between text fragments (see, for example, Chapter 6 in Kowalski, 1997).</Paragraph> <Paragraph position="1"> Nevertheless, it is apparent that standard methods for detecting document similarity do not cover the whole range of what people perceive as similar.</Paragraph> <Paragraph position="2"> Common treatment of document similarity typically aims at a unified pairwise measure expressing the extent to which documents are similar to each other. Consequently, an Internet-surfer hitting the &quot;what's related&quot; button in her browser gets a list of pages that are supposed to be similar to the currently viewed one. Here, we originally address questions situated one-step ahead: Given documents that are already known to be similar, how they are How to refer a user to relevant aspects in a large collection of similar documents? One possible strategy, rooted in cognitive considerations, is to present for each pair of similar documents a detailed &quot;map&quot;, connecting corresponding concepts, entities or sub-topics. The present article provides initial directions towards identification of such correspondences.</Paragraph> <Paragraph position="3"> Consider, for example, the following two fragments taken from a pair of 1986 Reuters' news-articles: i. LOS ANGELES, March 13 - Computer Memories Inc .... agreed to acquire Hemdale Film Corp .... That company ' s owner, John Daly, would then become chief executive officer of the combined company...</Paragraph> </Section> <Section position="3" start_page="0" end_page="47" type="metho"> <SectionTitle> 2. NEW YORK, March 25 - Messidor Ltd </SectionTitle> <Paragraph position="0"> said it signed a letter of intent to acquire i00 pct of the outstanding shares of Tri ton Beleggineng Nederland B. V .... If approved, the president of Triton, Hendrik Bokma, will be nominated as chairman of the combined company ....</Paragraph> <Paragraph position="1"> Both fragments deal with the intention of a certain company to acquire another company. Since the word &quot;acquire' appears in both articles, keyword-based methods would interpret it as a positive evidence for evaluating the text fragments as similar to each other. More sophisticated methods (e.g. Latent Semantic Indexing; Deerwester et al., 1990) incorporate vector-based statistical term-similarity models that may take into account correspondence of different terms that resemble in their meaning. For example, the corresponding term pairs 'owner&quot; ~ 'president', and 'chief executive officer' ~ 'chairman' may contribute to the unified value of evaluated similarity. Now, consider another pair of terms: 'become' -'nominated'. These terms probably share only a moderate degree of similarity in general, but a human reader will find their correspondence much more meaningful in this particular context. Identification of this context-dependent equivalence enables a reader to perceive that John Daly and Hendrik Bokma, respectively mentioned in the above texts, play an analogous part of being appointed to a managerial position. Existing similarity evaluation methods do not consider such analogies and do not provide tools for pointing them out.</Paragraph> <Paragraph position="2"> Unlike common methods in automated natural language processing, cognitive research has emphasized the role of analogy in human thinking. The ability to detect analogous similarities between complex objects is presented by cognitive theories as a foremost mechanism in reasoning, problem solving and in human intelligence in general. The structure mapping theory (Gentner, 1983) presents analogy as mapping between two distinct systems. Particular entities that compose each system are not similar in general, but rather the relations among them resemble each other.</Paragraph> <Paragraph position="3"> Hence, entities in one system are perceived as playing a similar role to that played by corresponding entities in the other system.</Paragraph> <Paragraph position="4"> Another approach (Hofstadter et al., 1995) emphasizes the context-dependent interplay between perceiving features of the systems under comparison and creating representations that are suitable for mutual mapping.</Paragraph> <Paragraph position="5"> Motivated by the above considerations, we present an initial step towards identifying automatically corresponding entities in natural language texts. At this stage of our research, correspondences are based on term similarity only, so terms describing similar topics are coupled. Identification and mapping of both entities and relations, using additional information, such as syntactic constructs (a direction which has been proposed in Hasse, 1995), will be handled in subsequent stages.</Paragraph> <Paragraph position="6"> However, presenting context-dependent topic correspondences in a pair of texts is by itself a non-trivial elaboration of standard approaches to document similarity.</Paragraph> <Paragraph position="7"> Unsupervised specification of precise structure, let alone the optimal structure, is known to be an ill-posed problem even in classical tasks such as straightforward clustering. Nevertheless, we observe that our task here is to find relevant structure in the data. In section 2, we present a model for the structure we aim at. Then, in section 3, we recourse to a standard mechanism of capturing the quality of the proposed structure by a suitable cost function, followed by an algorithm seeking to minimize the cost. As in more studied learning-tasks, alternative costs or optimization methods are possible and form legitimate subject for future research. At this stage, we concentrate on demonstrating the feasibility of getting sub-topic similarity maps between text fragments through a novel bipartite clustering setting.</Paragraph> <Paragraph position="8"> 2. The Model: Term Subset Coupling by</Paragraph> <Section position="1" start_page="45" end_page="47" type="sub_section"> <SectionTitle> Bipartite Clustering </SectionTitle> <Paragraph position="0"> The present study suggests a framework for identifying corresponding sub-topics within a pair of text fragments. Our model represents sub-topics as groups of related terms, which are assumed to correspond to actual sub-topics. For this, the sets of terms appearing in each one of the fragments are divided into coupled subsets.</Paragraph> <Paragraph position="1"> A pair of coupled subsets, one from each fragment, is supposed to represent corresponding sub-topics from the compared fragments.</Paragraph> <Paragraph position="2"> For the illustration, consider the following small term sets: (i) {attendant, minister, government} (2) {employee, manager} (3) {student, university} Term-subset coupling, based on semantic term similarity, applied to the first two term sets, might produce the following subset couples: {attendant } -- {employee} {minister, government} -- {manager} For similar considerations applied to sets (1) and (3), the result might look like: {attendant, minister} -- {student} {government } -- {university} These illustrative examples demonstrate expected topical partitions of the term sets according to the diagnosticity principle (Tversky, 1977). How each set is divided depends on how terms of both sets resemble each other: in the first case, the grouped topics are &quot;workers&quot; and &quot;management&quot;; in the second case - &quot;individuals&quot; and &quot;institutions&quot;. For obtaining subset coupling, we apply clustering methods. Quite a few previous works investigated the idea of identifying semantic substances with term clusters. Term clustering methods are typically based on the statistics of term co-occurrence within a word window, or within syntactic constructs (e.g. Pereira et al., 1993). The notion pairwise clustering refers to clustering established, as in the present study, on previous assessment of term similarity values a process often based by itself on term co-occurrence data (e.g. Lin, 1999).</Paragraph> <Paragraph position="3"> A standard pairwise clustering problem can be represented by a weighted graph, where each node stands for a data point and each edge is weighted according to the degree of similarity of the nodes it connects. A (hard) clustering procedure produces partition of the graph nodes to disjoint connected components forming a cluster configuration.</Paragraph> <Paragraph position="4"> Our setting is special in that it considers only similarity values referring to term pairs from two distinct text fragments, such as 'attendant''manager' in the example above, but not ' attendant'-' minister'. The exclusion of withinfragment similarities is conformed to our context-oriented approach, but there is no essential restriction on incorporating them in a more comprehensive model. Consequently, our setting is represented by a bipartite graph containing only edges connecting nodes from two distinct node sets, each of which associated with terms from a different text fragment. A term that appears in both articles is represented independently in both sets.</Paragraph> <Paragraph position="5"> The use of clustering within a bipartite graph (bipartite clustering) is not common in natural language processing. Hofmann and Puzicha (1998) introduce taxonomy of likelihood-based clustering algorithms for co- occurrence data, some of which produce bipartite clustering. To illustrate their soft clustering method, they present sets of nouns and adjectives that tend to co-occur in a large corpus. Sheffer-Hazan (1997) developed a bipartite clustering algorithm based on description length considerations for purposes of knowledge summarization and text mining. Both works exploit co-occurrence data for exposure of global characteristics of a corpus. The present study refers too, through its use of pre-compiled similarity data, to co-occurrence statistics in a corpus. Here, we go beyond that to get fine-grained context-dependent groupings in the term sets of particular text-fragment pairs.</Paragraph> <Paragraph position="6"> When pairwise clustering algorithms are applied on a bipartite graph, the assignment of a term from one of the sets into a cluster is influenced by the assignments of similar terms from the other set. Each one of the resulting clusters, if contains more than a single element, necessarily contains terms from both parts, e.g. <minister, government, manager> in the example above.</Paragraph> <Paragraph position="7"> Therefore, a cluster couples two term subsets, each from a different fragment: the subset {manager} is coupled to the subset {minister, government}. Clusters containing a single element represent terms that could not be assigned to any of the coupled subsets by the clustering method.</Paragraph> <Paragraph position="8"> 3. Algorithms: Balancing Within-Cluster and Between-Cluster Similarities Let X and Y denote the sets of the terms appearing in a given pair of articles. We currently use the &quot;bag of words&quot; model, where term repetitions are not counted. Non-negative similarity values, s(x,y), are given (as input) for each x~X and y~Y. Assume that some clustering procedure is applied to the appropriate bipartite graph, so that a partition of the graph nodes is given. Denote by Cx the part containing x~X. Recall that if Cx contains additional elements, some of them must be elements of Y. Hence, Cx represents coupling of the subsets XACx and YnCx.</Paragraph> <Paragraph position="9"> A basic clustering strategy is the greedy single-link agglomerative method. It starts with a configuration in which for each x~X and y~ Y, Cx = {x}, Cy = {y}. Then, the method repeatedly merges a pair of clusters Cx and Cy such that x and y are the most similar elements for which Cx C/= Cy. The result is a hierarchical arrangement of clusters, also called dendogram. There is no fixed recipe of how to select the best clustering configuration (partitioning) in the hierarchy. Furthermore, in our case the number of target sub-topics is not known in advance.</Paragraph> <Paragraph position="10"> We thus refer to the obtained hierarchy as representing a range of possible cluster configurations, corresponding to varying granularity levels.</Paragraph> <Paragraph position="11"> An alternative approach states in advance what is expected from a good clustering configuration, rather than letting the merging process dictate the clustering as in the case of single-link. This is customarily done by formulating a cost function, to be minimized by an optimal configuration. In our case, as in clustering in general, a cost function reflects the interplay between two dual constraints: (i) Maximizing within-cluster similarity, i.e. the tendency to include similar objects in the same cluster. It should be stressed that in the bipartite setting the notion of 'within-cluster' refers to similarity values between pairs of terms from coupled subsets, while the actual similarities within each subset are not considered. The excessive satisfaction of this constraint dictates a cluster configuration containing many small clusters, each characterized by high similarity values among its members.</Paragraph> <Paragraph position="12"> (ii) Minimizing between-cluster similarity, i.e. the tendency to avoid assigning similar objects (in the bipartite setting - from distinct fragments) into different clusters. The excessive satisfaction of this constraint results in obtaining large clusters, so that only minimal between-cluster similarity is present. We have considered several cost function schemes, reflecting different types of interactions between the above two constraints. One particular scheme, which enables obtaining context-dependent subset coupling at various granularity levels, is presented here.</Paragraph> <Paragraph position="13"> This scheme captures the between-cluster similarity minimization constraint by including, for each term xe X (and correspondingly for each ye Y), a cost component proportional to the between-cluster similarity values associated with that term, i.e. proportional to ~y~ r_cxS(x,y). According to the other constraint of within-cluster similarity maximization, each term x is supposed to be assigned into a cluster such that its contribution to the total measure of within-cluster similarity is maximal. To obtain a cost measure, which is inversely proportional to the contribution of x to total within-cluster similarity, we measure the total degree of within-cluster similarity obtained if x were removed from its cluster Cx. That is, we add for each x~ X (and correspondingly for each y~ Y) a cost component proportional to the total contribution to within-cluster similarity of the other subset members: ~c~_lx~#cxS(x',y).</Paragraph> <Paragraph position="14"> This component is further multiplied by 1/IX! for normalizing it relatively to the entire set size. Finally, the cost function scheme introduces a parameter, 0 < a < 1, which controls the relative impact of each of the two constraints. The resulting scheme is thus weighted sum of the two cost components for all terms in X and Y: ~M)= x~_X--Cy li\[ y~yc~Cy.qy} k ~.Xr-Cy }J Varying ~ has the effect of changing cluster size within the optimal configuration, due to the varying impact of the two constraints (increasing a reduces cluster size, and vice versa). Another interesting property of this scheme is that coupling two singletons, which have a positive similarity value, always reduces the total cost. This is because such coupling, forming a twomember cluster, reduces between-cluster similarity cost and does not increase within-cluster similarity cost.</Paragraph> <Paragraph position="15"> Note that E(M) pretends to reflect balance of constrains, as described above, only for a particular pair of documents at a time. Its potential value as a basis for unified document similarity measure, sensitive to context-dependent and analogous similarities, is yet to be investigated.</Paragraph> <Paragraph position="16"> There are sophisticated techniques to compute an optimal solution minimizing the cost function for a given o~ value, e.g. simulated annealing and deterministic annealing (Hofmann and Buhmann, 1997). A simple strategy, assumed to suffice for preliminary demonstration of cost function behavior for any a, is a greedy method, similar to the single-link method. It starts with a configuration in which for each x and y, Cx = {x}, Cy = {y} and then merges repeatedly the two clusters whose merge minimizes the cost mostly. Unlike single-link, this process stops when no further cost reduction is possible.</Paragraph> </Section> </Section> <Section position="4" start_page="47" end_page="47" type="metho"> <SectionTitle> 4. Results: Hierarchy and Granularity </SectionTitle> <Paragraph position="0"> Our experiments were performed for term coupling between pairs of Reuters news articles. Here we qualitatively demonstrate the results using the same pair of articles of the example in Section 1 (devising a quantitative evaluation method for our task is an issue for future research). We used pairwise term similarity values that were compiled by Dekang Lin, using a similarity measure based on information theoretical considerations, from co-occurrence data in a large corpus of news articles (Lin, 1999; data available for download from http://www.cs.umanitoba.ca/-lindek/sims.tgz). The term sets were taken to be the sets of words, in each article, which had at least one positive similarity value with a term in the other article. The vocabulary included verbs, nouns, adjectives and adverbs, excluding a small set of stop words (e.g. 'about'). The Conexor NP&Name Parser (Voutilainen, 1997) was used to obtain word lemmas.</Paragraph> <Paragraph position="1"> Figure I displays detailed term subset coupling generated by the single-link procedure. The hierarchy is indicated by different widths of contours bounding term subsets. Each contour width presents the clusters obtained after all merges that were imposed by similarity values larger than a threshold t. Coupling connections are displayed for the most detailed granularity level. An apparent drawback of this method is that many terms are assigned into clusters only in a late stage, although they seem to be related to one or more of the smaller clusters. E.g.</Paragraph> <Paragraph position="2"> 'management' seems to be related, and indeed has non-zero similarity values, to 'chairman' 'director' and 'president' as well as to 'chief. This is indicated by including such terms in the largest thin frames in Figure 1, but not in any bold smaller frame.</Paragraph> <Paragraph position="3"> We have also implemented more sophisticated methods proposed recently (Blatt et al., 1997; Gdalyahu et al., 1999) that are related to the single-link strategy. These methods are designed to overcome cases where few &quot;noisy&quot; data points invoke union of clusters that would have remain separate in the absence of these points. Both methods repeatedly sample stochastic approximated cluster configurations.</Paragraph> <Paragraph position="4"> Elements, persistently found in the same cluster across the sample, are assigned to the same cluster also in the final solution. The results obtained with these methods are qualitatively similar to those obtained with single-link. This suggests that the fact that certain terms remain uncoupled in high granularity levels can not be attributed to random inaccuracies in the data.</Paragraph> <Paragraph position="5"> Figure 2 displays a detailed term subset coupling generated by the cost-guided greedy strategy.</Paragraph> <Paragraph position="6"> The lack of strict hierarchy prevents displaying a wide range of granularity levels within the figure, so a sample of clusters is presented. The gray clusters demonstrate the impact of lower o~ values on cluster granularity. Several of the coupled term-subsets represent actual sub-topics, such as &quot;trade operations&quot; and &quot;managerial positions&quot;. Comparing with the single link algorithm, the cost-based algorithm does succeed to couple related terms such as 'management' and 'chairman' within a relatively tight cluster. Note also that the algorithm couples the words 'become' and 'nominate', as discussed in Section 1.</Paragraph> </Section> class="xml-element"></Paper>