File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3812_metho.xml
Size: 23,450 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3812"> <Title>Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems</Title> <Section position="4" start_page="0" end_page="75" type="metho"> <SectionTitle> 2 Chinese Whispers Algorithm </SectionTitle> <Paragraph position="0"> In this section, the Chinese Whispers (CW) algorithm is outlined. After recalling important concepts from Graph Theory (cf. Bollobas 1998), we describe two views on the algorithm. The second view is used to relate CW to another graph clustering algorithm, namely MCL (van Dongen, 2000).</Paragraph> <Paragraph position="1"> We use the following notation throughout this paper: Let G=(V,E) be a weighted graph with nodes (vi)[?]V and weighted edges (vi, vj, wij) [?]E with weight wij. If (vi, vj, wij)[?]E implies (vj, vi, wij)[?]E, then the graph is undirected. If all weights are 1, G is called unweighted.</Paragraph> <Paragraph position="2"> The degree of a node is the number of edges a node takes part in. The neighborhood of a node v is defined by the set of all nodes v' such that (v,v',w)[?]E or (v',v,w)[?]E; it consists of all nodes that are connected to v.</Paragraph> <Paragraph position="3"> The adjacency matrix AG of a graph G with n nodes is an nxn matrix where the entry aij denotes the weight of the edge between vi and vj , 0 otherwise.</Paragraph> <Paragraph position="4"> The class matrix DG of a Graph G with n nodes is an nxn matrix where rows represent nodes and columns represent classes (ci)[?]C. The value dij at row i and column j represents the amount of vi as belonging to a class cj. For convention, class matrices are row-normalized; the i-th row denotes a distribution of vi over C. If all rows have exactly one non-zero entry with value 1, DG denotes a hard partitioning of V, soft partitioning otherwise.</Paragraph> <Section position="1" start_page="73" end_page="74" type="sub_section"> <SectionTitle> 2.1 Chinese Whispers algorithm </SectionTitle> <Paragraph position="0"> CW is a very basic - yet effective - algorithm to partition the nodes of weighted, undirected graphs.</Paragraph> <Paragraph position="1"> It is motivated by the eponymous children's game, where children whisper words to each other. While the game's goal is to arrive at some funny derivative of the original message by passing it through several noisy channels, the CW algorithm aims at finding groups of nodes that broadcast the same message to their neighbors. It can be viewed as a simulation of an agent-based social network; for an overview of this field, see (Amblard 2002).</Paragraph> <Paragraph position="2"> The algorithm is outlined in figure 1: initialize: forall vi in V: class(vi)=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; Intuitively, the algorithm works as follows in a bottom-up fashion: First, all nodes get different classes. Then the nodes are processed for a small number of iterations and inherit the strongest class in the local neighborhood. This is the class whose sum of edge weights to the current node is maximal. In case of multiple strongest classes, one is chosen randomly. Regions of the same class stabilize during the iteration and grow until they reach the border of a stable region of another class. Note that classes are updated immediately: a node can obtain classes from the neighborhood that were introduced there in the same iteration.</Paragraph> <Paragraph position="3"> Figure 2 illustrates how a small unweighted graph is clustered into two regions in three iterations. Different classes are symbolized by different shades of grey.</Paragraph> <Paragraph position="4"> It is possible to introduce a random mutation rate that assigns new classes with a probability decreasing in the number of iterations as described in (Biemann & Teresniak 2005). This showed having positive effects for small graphs because of slower convergence in early iterations.</Paragraph> <Paragraph position="5"> The CW algorithm cannot cross component boundaries, because there are no edges between nodes belonging to different components. Further, nodes that are not connected by any edge are discarded from the clustering process, which possibly leaves a portion of nodes unclustered. Formally, CW does not converge, as figure 3 exemplifies: here, the middle node's neighborhood consists of a tie which can be decided in assigning the class of the left or the class of the right nodes in any iteration all over again. Ties, however, do not play a major role in weighted graphs.</Paragraph> <Paragraph position="6"> black class. Small numbers denote edge weights. Apart from ties, the classes usually do not change any more after a handful of iterations. The number of iterations depends on the diameter of the graph: the larger the distance between two nodes is, the more iterations it takes to percolate information from one to another.</Paragraph> <Paragraph position="7"> The result of CW is a hard partitioning of the given graph into a number of partitions that emerges in the process - CW is parameter-free. It is possible to obtain a soft partitioning by assigning a class distribution to each node, based on the weighted distribution of (hard) classes in its neighborhood in a final step.</Paragraph> <Paragraph position="8"> The outcomes of CW resemble those of Min-Cut (Wu & Leahy 1993): Dense regions in the graph are grouped into one cluster while sparsely connected regions are separated. In contrast to Min-Cut, CW does not find an optimal hierarchical clustering but yields a non-hierarchical (flat) partition. Furthermore, it does not require any threshold as input parameter and is more efficient. Another algorithm that uses only local contexts for time-linear clustering is DBSCAN as, described in (Ester et al. 1996), needing two input parameters (although the authors propose an interactive approach to determine them). DBSCAN is especially suited for graphs with a geometrical interpretation, i.e. the objects have coordinates in a multidimensional space. A quite similar algorithm to CW is MAJORCLUST (Stein & Niggemann 1996), which is based on a comparable idea but converges slower.</Paragraph> </Section> <Section position="2" start_page="74" end_page="75" type="sub_section"> <SectionTitle> 2.2 Chinese Whispers as matrix operation </SectionTitle> <Paragraph position="0"> As CW is a special case of Markov-Chain-Clustering (MCL) (van Dongen, 2000), we spend a few words on explaining it. MCL is the parallel simulation of all possible random walks up to a finite length on a graph G. The idea is that random walkers are more likely to end up in the same cluster where they started than walking across clusters. MCL simulates flow on a graph by repeatedly updating transition probabilities between all nodes, eventually converging to a transition matrix after k steps that can be interpreted as a clustering of G. This is achieved by alternating an expansion step and an inflation step.</Paragraph> <Paragraph position="1"> The expansion step is a matrix multiplication of MG with the current transition matrix. The inflation step is a column-wise non-linear operator that increases the contrast between small and large transition probabilities and normalizes the column-wise sums to 1. The k matrix multiplications of the expansion step of MCL lead to its time-complexity of O(k[?]n2).</Paragraph> <Paragraph position="2"> It has been observed in (van Dongen, 2000), that only the first couple of iterations operate on dense matrices - when using a strong inflation operator, matrices in the later steps tend to be sparse. The author further discusses pruning schemes that keep only some of the largest entries per column, leading to drastic optimization possibilities. But the most aggressive sort of pruning is not considered: only keeping one single largest entry. Exactly this is conducted in the basic CW process. Let maxrow(.) be an operator that operates row-wise on a matrix and sets all entries of a row to zero except the largest entry, which is set to 1. Then the algorithm is denoted as simple as this:</Paragraph> <Paragraph position="4"> time step, In is the identity matrix of size nxn, AG is the adjacency matrix of graph G.</Paragraph> <Paragraph position="5"> By applying maxrow(.), Dt-1 has exactly n non-zero entries. This causes the time-complexity to be dependent on the number of edges, namely O(k[?]|E|). In the worst case of a fully connected graph, this equals the time-complexity of MCL.</Paragraph> <Paragraph position="6"> A problem with the matrix CW process is that it does not necessarily converge to an iterationinvariant class matrix D, but rather to a pair of oscillating class matrices. Figure 5 shows an example.</Paragraph> <Paragraph position="7"> This is caused by the stepwise update of the class matrix. As opposed to this, the CW algorithm as outlined in figure 1 continuously updates D after the processing of each node. To avoid these oscillations, one of the following measures can be taken: * Random mutation: with some probability, the maxrow-operator places the 1 for an otherwise unused class * Keep class: with some probability, the row is copied from Dt-1 to Dt * Continuous update (equivalent to CW as described in section 2.1.) While converging to the same limits, the continuous update strategy converges the fastest because prominent classes are spread much faster in early iterations.</Paragraph> </Section> </Section> <Section position="5" start_page="75" end_page="76" type="metho"> <SectionTitle> 3 Experiments with synthetic graphs </SectionTitle> <Paragraph position="0"> The analysis of the CW process is difficult due to its nonlinear nature. Its run-time complexity indicates that it cannot directly optimize most global graph cluster measures because of their NP-completeness (Sima and Schaeffer, 2005).</Paragraph> <Paragraph position="1"> Therefore we perform experiments on synthetic graphs to empirically arrive at an impression of our algorithm's abilities. All experiments were conducted with an implementation following figure 1. For experiments with synthetic graphs, we restrict ourselves to unweighted graphs, if not stated explicitly.</Paragraph> <Section position="1" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.1 Bi-partite cliques </SectionTitle> <Paragraph position="0"> A cluster algorithm should keep dense regions together while cutting apart regions that are sparsely connected. The highest density is reached in fully connected sub-graphs of n nodes, a.k.a. ncliques. We define an n-bipartite-clique as a graph of two n-cliques, which are connected such that each node has exactly one edge going to the clique it, does not belong to.</Paragraph> <Paragraph position="1"> Figures 5 and 6 are n-partite cliques for n=4,10.</Paragraph> <Paragraph position="2"> We clearly expect a clustering algorithm to cut the two cliques apart. As we operate on unweighted graphs, however, CW is left with two choices: producing two clusters or grouping all nodes into one cluster. This is largely dependent on the random choices in very early iterations - if the same class is assigned to several nodes in both cliques, it will finally cover the whole graph.</Paragraph> <Paragraph position="3"> when applying CW on n-bipartite cliques It is clearly a drawback that the outcome of CW is non-deterministic. Only half of the experiments with 4-bipartite cliques resulted in separation. However, the problem is most dramatic on small graphs and ceases to exist for larger graphs as demonstrated in figure 7.</Paragraph> </Section> <Section position="2" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 3.2 Small world graphs </SectionTitle> <Paragraph position="0"> A structure that has been reported to occur in an enormous number of natural systems is the small world (SW) graph. Space prohibits an in-depth discussion, which can be found in (Watts 1999).</Paragraph> <Paragraph position="1"> Here, we restrict ourselves to SW-graphs in language data. In (Ferrer-i-Cancho and Sole, 2001), co-occurrence graphs as used in the experiment section are reported to possess the small world property, i.e. a high clustering co-efficient and short average path length between arbitrary nodes. Steyvers and Tenenbaum (2005) show that association networks as well as semantic resources are scale-free SW-graphs: their degree distribution follows a power law. A generative model is provided that generates undirected, scale-free SW-graphs in the following way: We start with a small number of fully connected nodes.</Paragraph> <Paragraph position="2"> When adding a new node, an existing node v is chosen with a probability according to its degree. The new node is connected to M nodes in the neighborhood of v. The generative model is parameterized by the number of nodes n and the network's mean connectivity, which approaches 2M for large n.</Paragraph> <Paragraph position="3"> Let us assume that we deal with natural systems that can be characterized by small world graphs. If two or more of those systems interfere, their graphs are joined by merging some nodes, retaining their edges. A graph-clustering algorithm should split up the resulting graph in its previous parts, at least if not too many nodes were merged. We conducted experiments to measure CW's performance on SW-graph mixtures: We generated graphs of various sizes, merged them by twos to a various extent and measured the amount of cases where clustering with CW leads to the reconstruction of the original parts. When generating SW-graphs with the Steyvers-Tenenbaum model, we fixed M to 10 and varied n and the merge rate r, which is the fraction of nodes of the smaller graph that is merged with nodes of the larger graph.</Paragraph> <Paragraph position="4"> mixtures of 300, 3,000 and 30,000 nodes and mixtures of 300 with 30,000 nodes.</Paragraph> <Paragraph position="5"> It is not surprising that separating the two parts is more difficult for higher r. Results are not very sensitive to size and size ratio, indicating that CW is able to identify clusters even if they differ considerably in size - it even performs best at the skewed mixtures. At merge rates between 20% and 30%, still more then half of the mixtures are separated correctly and can be found when averaging CW's outcome over several runs.</Paragraph> </Section> <Section position="3" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 3.3 Speed issues </SectionTitle> <Paragraph position="0"> As formally, the algorithm does not converge, it is important to define a stop criterion or to set the number of iterations. To show that only a few iterations are needed until almost-convergence, we measured the normalized Mutual Information (MI)1 between the clustering in the 50th iteration and the clusterings of earlier iterations. This was conducted for two unweighted SW-graphs with 1,000 (1K) and 10,000 (10K) nodes, M=5 and a weighted 7-lingual co-occurrence graph (cf.</Paragraph> <Paragraph position="1"> section 4.1) with 22,805 nodes and 232,875 edges.</Paragraph> <Paragraph position="2"> Table 1 indicates that for unweighted graphs, changes are only small after 20-30 iterations. In iterations 40-50, the normalized MI-values do not improve any more. The weighted graph converges much faster due to fewer ties and reaches a stable plateau after only 6 iterations.</Paragraph> </Section> </Section> <Section position="6" start_page="76" end_page="77" type="metho"> <SectionTitle> 4 NLP Experiments </SectionTitle> <Paragraph position="0"> In this section, some experiments with graphs originating from natural language data are presented. First, we define the notion of co-occurrence graphs, which are used in sections 4.1 and 4.3: Two words co-occur if they can both be found in a certain unit of text, here a sentence.</Paragraph> <Paragraph position="1"> Employing a significance measure, we determine whether their co-occurrences are significant or random. In this case, we use the log-likelihood measure as described in (Dunning 1993). We use the words as nodes in the graph. The weight of an 1 defined for two random variables X and Y as (H(X)+H(Y)-H(X,Y))/max(H(X),H(Y)) with H(X) entropy. A value of 0 denotes indepenence, 1 is perfect congruence.</Paragraph> <Paragraph position="2"> edge between two words is set to the significance value of their co-occurrence, if it exceeds a certain threshold. In the experiments, we used significances from 15 on. The entirety of words that are involved in at least one edge together with these edges is called co-occurrence graph (cf.</Paragraph> <Paragraph position="3"> Biemann et al. 2004).</Paragraph> <Paragraph position="4"> In general, CW produces a large number of clusters on real-world graphs, of which the majority is very small. For most applications, it might be advisable to define a minimum cluster size or something alike.</Paragraph> <Section position="1" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 4.1 Language Separation </SectionTitle> <Paragraph position="0"> This section shortly reviews the results of (Biemann and Teresniak, 2005), where CW was first described. The task was to separate a multilingual corpus by languages, assuming its tokenization in sentences.</Paragraph> <Paragraph position="1"> The co-occurrence graph of a multilingual corpus resembles the synthetic SW-graphs: Every language forms a separate co-occurrence graph, some words that are used in more than one language are members of several graphs, connecting them. By CW-partitioning, the graph is split into its monolingual parts. These parts are used as word lists for word-based language identification. (Biemann and Teresniak, 2005) report almost perfect performance on getting 7-lingual corpora with equisized parts sorted apart as well as highly skewed mixtures of two languages.</Paragraph> <Paragraph position="2"> In the process, language-ambiguous words are assigned to only one language, which did not hurt performance due to the high redundancy of the task. However, it would have been possible to use the soft partitioning to acquire a distribution over languages for each word.</Paragraph> </Section> <Section position="2" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 4.2 Acquisition of Word Classes </SectionTitle> <Paragraph position="0"> For the acquisition of word classes, we use a different graph: the second-order graph on neighboring co-occurrences. To set up the graph, a co-occurrence calculation is performed which yields significant word pairs based on their occurrence as immediate neighbors. This can be perceived as a bipartite graph, figure 9a gives a toy example. Note that if similar words occur in both parts, they form two distinct nodes.</Paragraph> <Paragraph position="1"> This graph is transformed into a second-order graph by comparing the number of common right and left neighbors for two words. The similarity (edge weight) between two words is the sum of common neighbors. Figure 9b depicts the second-order graph derived from figure 9a and its partitioning by CW. The word-class-ambiguous word &quot;drink&quot; (to drink the drink) is responsible for all intra-cluster edges. The hypothesis here is that words sharing many neighbors should usually be observed with the same part-of-speech and get high weights in the second order graph. In figure 9, three clusters are obtained that correspond to different parts-of-speech (POS).</Paragraph> <Paragraph position="2"> To test this on a large scale, we computed the second-order similarity graph for the British National Corpus (BNC), excluding the most frequent 2000 words and drawing edges between words if they shared at least four left and right neighbors. The clusters are checked against a lexicon that contains the most frequent tag for each word in the BNC. The largest clusters are presented in table 2 .</Paragraph> <Paragraph position="3"> size tags:count sample words 18432 NN:17120 AJ: 631 secret, officials, transport, unemployment, farm, county, wood, procedure, grounds, ...</Paragraph> <Paragraph position="4"> second order graph with CW.</Paragraph> <Paragraph position="5"> In total, CW produced 282 clusters, of which 26 exceed a size of 100. The weighted average of cluster purity (i.e. the number of predominant tags divided by cluster size) was measured at 88.8%, which exceeds significantly the precision of 53% on word type as reported by Schutze (1995) on a related task. How to use this kind of word clusters to improve the accuracy of POS-taggers is outlined in (Ushioda, 1996).</Paragraph> </Section> <Section position="3" start_page="77" end_page="77" type="sub_section"> <SectionTitle> 4.3 Word Sense Induction </SectionTitle> <Paragraph position="0"> The task of word sense induction (WSI) is to find the different senses of a word. The number of senses is not known in advance, therefore has to be determined by the method.</Paragraph> <Paragraph position="1"> Similar to the approach as presented in (Dorow and Widdows, 2003) we construct a word graph.</Paragraph> <Paragraph position="2"> While there, edges between words are drawn iff words co-occur in enumerations, we use the co-occurrence graph. Dorow and Widdows construct a graph for a target word w by taking the sub-graph induced by the neighborhood of w (without w) and clustering it with MCL. We replace MCL by CW.</Paragraph> <Paragraph position="3"> The clusters are interpreted as representations of word senses.</Paragraph> <Paragraph position="4"> To judge results, the methodology of (Bordag, 2006) is adopted: To evaluate word sense induction, two sub-graphs induced by the neighborhood of different words are merged. The algorithm's ability to separate the merged graph into its previous parts can be measured in an unsupervised way. Bordag defines four measures: * retrieval precision (rP): similarity of the found sense with the gold standard sense * retrieval recall (rR): amount of words that have been correctly assigned to the gold standard sense * precision (P): fraction of correctly found disambiguations * recall (R): fraction of correctly found senses We used the same program to compute co-occurrences on the same corpus (the BNC).</Paragraph> <Paragraph position="5"> Therefore it is possible to directly compare our results to Bordag's, who uses a triplet-based hierarchical graph clustering approach. The method was chosen because of its appropriateness for unlabelled data: without linguistic pre-processing like tagging or parsing, only the disambiguation mechanism is measured and not the quality of the preprocessing steps. We provide scores for his test 1 (word classes separately) and test 3 (words of different frequency bands). Data was obtained from BNC's raw text; evaluation was frequency Results (tables 3 and 4) suggest that both algorithms arrive at about equal overall performance (P and R). Chinese Whispers clustering is able to capture the same information as a specialized graph-clustering algorithm for WSI, given the same input. The slightly superior performance on rR and rP indicates that CW leaves fewer words unclustered, which can be advantageous when using the clusters as clues in word sense disambiguation.</Paragraph> </Section> </Section> class="xml-element"></Paper>