File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0909_metho.xml
Size: 21,333 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0909"> <Title>A cross-comparison of two clustering methods</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Semantic domain learning </SectionTitle> <Paragraph position="0"> This description aims at showing the data used for learning, and the specificity of the learned classes.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2.1 The thematic segmentation: SEGCOHLEX </SectionTitle> <Paragraph position="0"> Studied texts are newspaper articles coming from two corpora: &quot;Le Monde&quot; and &quot;AFP&quot;. Some of these texts have been used to build a lexical network where links between two words represent an evaluation of their mutual information to capture semantic and pragmatic relations between them, computed from their co-occurrence count. In order to build class of words linked to a same topic, we first realize a topic segmentation of the texts in thematic units (TU) whose words refer to the same topic, and learning is applied on these thematic units.</Paragraph> <Paragraph position="1"> Text segmentation is based on the use of the collocation network. A topic is detected by computing a cohesion value for each word resulting from the relations found in the network between these words and their neighbors in a text. As in Kozima's work (Kozima, 1993), this computation operates on words belonging to a focus window that is moved all over the text.The cohesion values lead to build a graph and by successive transformations applied to it, texts are automatically divided in discourse segments. Such a method leads to delimit small segments, whose size is equivalent to a paragraph, i. e. capable of retrieving topic variations in short texts, as newswires for example. Table 1 shows an extract of the words belonging to a cohesive segment about a dedication of a book.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Semantic Domain learning in </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> SEGAPSITH </SectionTitle> <Paragraph position="0"> Learning a semantic domain consists of aggregating all the most cohesive thematic units, TUs, that are related to a same subject, i. e. a same kind of situation. We only retain segments whose cohesion value is higher than a threshold, in order to ground our learning on the more reliable units. Similarity between a thematic unit and a semantic domain is evaluated from their common words. When the similarity value exceeds a threshold, the thematic unit is aggregated to the semantic domain, otherwise a new domain is created. Each aggregation of a new thematic unit increases the system's knowledge about one topic by reinforcing recurrent words and adding new ones. Weights on words represent their importance relative to the topic and are computed from the number of occurrences of these words in the TUs.</Paragraph> <Paragraph position="1"> Units related to a same topic are found in different texts and often develop different points of view of a same type of subject. To ensure a better similarity between them, SEGAPSITH enriches a particular description given by a text segment by adding to these units those words of the collocation network that are particularly linked to the words found in the segment. Table 2 gives an extract of the words added to the segment of Table This method leads SEGAPSITH to learn specific topic representations (see Table 3) as opposed to (Lin, 1997) for example, whose method builds general topic descriptions as for economy, sport, etc. Moreover, it does not depend on any a priori classification of the texts.</Paragraph> <Paragraph position="2"> We applied the learning module of SEGAPSITH on one month (May 1994) of AFP newswires, corresponding to 7823 TUs. In our experiments (Ferret and Grau, 1998a), (Ferret and Grau, 1998b), we showed that domains reach a stability at 15 to 20 aggregations, and that words having a weight below 0.1 are rarely related to the domain. Thus, we only selected domains resulting from at least 15 aggregations for our crosscomparison, i.e. 71 domains regrouping 4935 TUs and 4380 words having a weight upon 0.1.</Paragraph> <Paragraph position="3"> A lot of domains share common words, and are close enough to be considered as different representations of specific points of view of a general topic, as economy, sport, etc.</Paragraph> </Section> <Section position="6" start_page="0" end_page="71" type="metho"> <SectionTitle> 3 Entropy-based clustering </SectionTitle> <Paragraph position="0"> The second clustering method gives an optimal partition of the 4935 thematic units in 71 non-words weight words weight strider 0.683 entourer (to surround) 0.368 toward 0.683 signature (signature) 0.366 d'edicacer (to dedicate) 0.522 exemplaire (exemplar) 0.357 apposer (to append) 0.467 page (page) 0.332 pointu (sharp-pointed) 0.454 train (train) 0.331 relater (to relate) 0.445 centaine (hundred) 0.330 boycottage (boycotting) 0.436 sentir (to feel) 0.328 autobus (bus) 0.435 livre (book) 0.289 enfoncer (to drive in) 0.410 personne (person) 0.267 inferred words weight inferred words weight paraphe (paraph) 0.522 imprimerie (press) 0.418 presse parisien (parisian-press) 0.480 'editer (to publish) 0.407 best seller (best seller) 0.477 biographie (biography) 0.406 maison d''edition (publishing house) 0.450 librairie (bookshop) 0.405 words occurrences weight juge d'instruction (examining judge) 58 0.501 garde `a vue (police custody) 50 0.442 bien social (public property) 46 0.428 inculpation (charging) 49 0.421 'ecrouer (to imprison) 45 0.417 chambre d'accusation (court of criminal appeal) 47 0.412 recel (receiving stolen goods) 42 0.397 pr'esumer (to presume) 45 0.382 police judiciaire (criminal investigation department) 42 0.381 escroquerie (fraud) 42 0.381 overlapping clusters according to the word distributions in the units. It is realized with an algorithm which looks like K-means (here K=71). Each cluster is the merge of several thematic units and is represented by its centroid. We search for the partition which minimizes the Kullback-Leibleir divergence (Cover and Thomas, 1991) between the word distributions of the thematic units and those of of their centroids. This entropy-based measure is convex (Jardino, 2000), this propriety permits to get an optimal partition whatever the initial conditions.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Entropy </SectionTitle> <Paragraph position="0"> We assume that each thematic unit is represented by one quantitative vector whose components are the relative occurrences of a selection of words related to the unit. The advantages of this normalization is that the representation of the thematic units does not depend on the length of the units and can be modelized in the frame of the information theory (Cover and Thomas, 1991).</Paragraph> <Paragraph position="1"> Assuming that O w;;tu is the occurrence of the word labelled w in the thematic unit labelled tu and that O tu is the occurrence of all the words in the thematic unit tu, such that O</Paragraph> <Paragraph position="3"> When the thematic units are unclassed, their entropy is given by (Cover and Thomas, 1991):</Paragraph> <Paragraph position="5"> When the thematic units are gathered in K clusters, labelled k, the cluster entropy is, H</Paragraph> <Paragraph position="7"> The cluster entropy is always higher than or equal to the unit entropy (log-sum rule (Cover and Thomas, 1991)), so that the Kullback-Leibleir divergence defined as:</Paragraph> <Paragraph position="9"> is always higher than or equal to 0.</Paragraph> </Section> <Section position="2" start_page="0" end_page="71" type="sub_section"> <SectionTitle> 3.2 Clustering algorithm Minimizing the Kullback-Leibler divergence </SectionTitle> <Paragraph position="0"> amounts to minimize the entropy H</Paragraph> <Paragraph position="2"> does not depend on the clusters.</Paragraph> <Paragraph position="3"> The number of possible partitions is huge, roughly 4935 . We have observed that a random search is faster than a systematic one (Jardino, 2000), and we have used this paradigm to build the algorithm described below: 1- Define a priori, K, the cluster number, here K=71.</Paragraph> <Paragraph position="4"> 2- Initialize: put all the thematic units in one cluster, calculate the entropy H K (equation 3). The remaining K-1 clusters are empty. 3- Do the random selection of one thematic unit and of another cluster for this unit. 4- Move the unit from its former cluster to the new randomly selected one, calculate the new entropy. 5- If the new entropy is lower, leave the unit in its new cluster, otherwise move it back to its initial cluster.</Paragraph> <Paragraph position="5"> 6- Repeat 3 to 5 until there is no more change. The optimal clustering of the 4935 thematic units in 71 clusters is performed on a workstation (SGI Indy) within twenty minutes.</Paragraph> </Section> </Section> <Section position="7" start_page="71" end_page="71" type="metho"> <SectionTitle> 4 Comparing two classifications </SectionTitle> <Paragraph position="0"> We established different criteria for comparing the two classifications, based on the elements used to describe the classes. First, each class is a set of words with an occurrence number for each of them; second it is also a set of thematic units.</Paragraph> <Paragraph position="1"> So, the comparison can be done along these two points of view.</Paragraph> <Paragraph position="2"> In order to evaluate the overlapping of the classes of words, we applied each classification method on the two classification results: the classes of words resulting from the second method are classified relative to the semantic domains. For comparing the classes of TUs, we applied the entropy measure on one hand to measure the overlapping of the classes, and and Mantel tests on the other hand to evaluate the differences in the repartition of all the TUs.</Paragraph> <Section position="1" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 4.1 The word point of view </SectionTitle> <Paragraph position="0"> The classification of the clusters relative to the semantic domains exploits the same similarity measure than the one used for the learning phase.</Paragraph> <Paragraph position="1"> In a first step, some domains are selected according to the value of the activ function:</Paragraph> <Paragraph position="3"> is the weight of the word i in the semantic domain d and W c;;i is the weight of the same word in the cluster c. This first step was used in the learning phase because the number of semantic domains was increasing rapidly and this measure leads to a first fast selection of interesting domains before evaluating an in-depth similarity. We kept this step, even if it was not necessary, in order to apply exactly the same method in the evaluation phase. Afterwards, each selected domain can be compared to the cluster by using the similarity measure given below. If one of these similarity values is greater than a given threshold, fixed to 0.25 in our tests, the cluster is linked to the domain that is the most similar to it. The similarity measure is:</Paragraph> <Paragraph position="5"> where the w index is used for indicating common words between the cluster c and the semantic domain d and the t index, for indicating all the words of the cluster or the domain. W is the weight of a word and O its occurrence number.</Paragraph> <Paragraph position="6"> The similarity measure is only based on the common words. As learning is unsupervised and incremental, differences at time t might disappear at time t+1. The comparison is based on the proportion of common words relative to the total of words of each entity to be compared. The evaluation of the common words in each entity is done according to their occurrence number and their weight. So, we avoid to obtain a high similarity value between two entities that only share very few words having a high weight. We combine these criteria in a geometrical mean for evaluating each entity and for computing the global similarity from the evaluation of the two entities in order to smooth the effect of few recurrent words when the domains are in their formation phase, words that would act as attractors otherwise.</Paragraph> <Paragraph position="7"> For each of the 71 clusters, we have searched for the nearest domain obtained with the same kind of entropy-based measure defined above. We assume that we have a probabilistic model which gives the predictions of the words according to the domains. In order to avoid the null value, nonlearned events are infered using the Witten-Bell interpolation (Witten and Bell, 1991). The interpolated value of the prediction of a word w, knowing the domain d is p domain and V the size of the vocabulary. Each cluster is also defined by a set of words and we compare the distribution of the words in the cluster with the distributions of the words in the domains (equation 8) with the Kullback-Leibler divergence. Each cluster is associated with the nearest domain.</Paragraph> <Paragraph position="8"> The results of the two classifications described above are given in Table 4. For the similarity-based classification, only 3 clusters do not match with any domain and 47 different domains are selected for the 68 remaining clusters with 34 links that are one cluster-one domain. For the entropy-based classification, 44 clusters have been associated to the 71 domains.</Paragraph> <Paragraph position="9"> Several clusters are linked to the same domain.</Paragraph> <Paragraph position="10"> This can been explained by the closeness of some of the domains. This is shown when they are hierarchically classified; we obtain then 34 general domains that regroup 1 to 5 domains each. We also observe that most clusters are only linked to one domain. The two methods give almost the same results and show that the two classifications are similar.</Paragraph> </Section> <Section position="2" start_page="71" end_page="71" type="sub_section"> <SectionTitle> and the clusters 4.2 The TU point of view 4.2.1 A simple comparison </SectionTitle> <Paragraph position="0"> One domain and one cluster are associated to each thematic unit. It is then possible to calculate the number of domains and clusters which partially or fully overlap. Table 5 represents the intersection between the two partitions. For each domain we chose the cluster which has the highest intersection with the domain. Then we calculated the percentage of thematic units of this domain which are both in the domain and in this chosen cluster.</Paragraph> <Paragraph position="1"> coverage number of clusters</Paragraph> <Paragraph position="3"> to each domain and those of the associated clusters which correspond to the highest intersection Height domains are identical to height clusters.</Paragraph> <Paragraph position="4"> The lowest coverage (18%) is obtained for one domain. The other seventy domains cover more than 20% of the clusters.</Paragraph> <Paragraph position="5"> In order to compare more precisely our two classifications, we used the coefficient as it was done by Dietterich in (Dietterich, 2000) and as it is often done in the field of remote sensing for example. The coefficient measures the degree of agreement among several judgements and is expressed as follows: is the proportion of times that we could expect the judgments to agree by chance. As we are in a case of unsupervised classification whereas Dietterich's work was about supervised classification (building of decision trees), we have first set a one-to-one mapping between the semantic domains and the clusters. This was done by a very simple procedure: we computed the size of the intersection between each cluster and each domain; then we iteratively mapped the cluster and the domain that had the largest intersection until each cluster was mapped with a domain. Of course, this is not an optimal procedure in order to ensure that the intersection of each couple of classes is the largest one but it can be considered as a baseline.</Paragraph> <Paragraph position="6"> Then, the evaluation of the coefficient was done by building a matrix K K, with K, the number of classes (clusters or domains), such that each element k i;;j is equal to the number of TUs assigned to the class i by SEGAPSITH and to the class j by the entropy-based clustering.</Paragraph> <Paragraph position="7"> , which estimates the probability that the two classifications agree, is defined by:</Paragraph> <Paragraph position="9"> where N is the total number of TUs. It evaluates the proportion of TUs that were put in the same classes by the two clustering algorithms.</Paragraph> <Paragraph position="10"> , which estimates the probability that the two algorithms agree by chance, is given by:</Paragraph> <Paragraph position="12"> are the marginal distributions. null The coefficient that results from the evalu- null is equal to 0 when the two clustering algorithms agree only by chance and to 1 when they really agree for each TU. Negative values occur when there is a systematic disagreement. null For the 71 classes of our test set, we computed the coefficient in two cases. First, with a random mapping of the clusters and domains. We got K = -0.013, which is very close to the agreement by chance. Second, we applied the above mapping procedure and got K = 0.484, which indicates a significant correlation between the two classifications. We think that with a more complex mapping procedure, the would be higher. 4.2.3 Application of the Mantel Test In this paradigm, each classified thematic unit, tu, is described according to its position in the classification in relation to all the classified elements. This position is characterized by a distance between tu and each other element. In the work we present here, we choose a simple dis- null However more complex distances may be used when the classifications are hierarchical ones for example. After this first step, each tu i of the two classifications to compare is characterized by a vector, each element of which, d ij , is equal to the distance between tu i and tu j . Hence, each classification is characterized by a distance matrix, which is a square symmetric matrix of size . Comparing the two classifications amounts to compare their distance matrices. In the ecology field, such kind of comparison is achieved by a statistical test, called the Mantel test (Mantel, 1967). In (Legendre, 2000), Legendre defines the Mantel test as &quot; a procedure to test the hypothesis that the distances among objects in a matrix A are linearly independent of the distances among the same objects in another matrix B. The result of this test may be used as support for or against the hypothesis that the process that generated the first set of distances is independent of the process that generated the second set. The unique feature of the Mantel test is the use of a linear statistic to assess the relationship between two distance matrices&quot;. The basic statistic used in the Mantel test is the Z statistic:</Paragraph> <Paragraph position="14"> As the elements of a distance matrix are not independent, the significance of ZS, the Z statistic that is computed for the two distance matrices to compare, is evaluated by comparing this value to the Z statistic that is computed for matrices whose rows and columns are randomly permuted. A distribution of random values is obtained by computing the statistic for many permuted matrices and if ZS is significantly above this distribution, the hypothesis that the two matrices are independent are equal to 1. Hence, each difference that could be introduced between the two matrices, decreases its value. As an exploratory step, we applied the Mantel test in order to compare the results of the two classification methods we presented in this article. We used the software developed by Adam Liedloff (Liedloff, 1999). As the number of TUs is too large in comparison with the capabilities of this software, we experimented the Mantel test only on a subset of 1000 TUs. With the distance matrix computed from the results of the two classification methods, we got a Z statistic (ZS) equal to 940;;894. The maximum value of ZS is 978;;460 for the domains and 948;;608 for the clusters. The random distribution was built from 99 permuted matrices and its ZS value is 937;;708 232. As the proportion of the values from the random distribution that are above ZS is equal to zero, we can reject the hypothesis that the two matrices are independent and as a consequence, we can think that the two compared classifications are globally similar. However, as the results of the Mantel test are not easy to interpret, further tests must be performed to see what are the relations between these results and those of the other comparing methods and to determine if this test is actually suited for comparing such kind of classifications.</Paragraph> </Section> </Section> class="xml-element"></Paper>