File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1144_metho.xml
Size: 17,014 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1144"> <Title>Concept Discovery from Text</Title> <Section position="4" start_page="2" end_page="22" type="metho"> <SectionTitle> 3 Word Similarity </SectionTitle> <Paragraph position="0"> Following (Lin 1998), we represent each word by a feature vector. Each feature corresponds to a context in which the word occurs. For example, threaten with __ is a context. If the word handgun occurred in this context, the context is a feature of handgun. The value of the feature is the pointwise mutual information (Manning and Schtze 1999 p.178) between the feature and the word. Let c be a context and</Paragraph> <Paragraph position="2"> (w) be the frequency count of a word w occurring in context c. The pointwise mutual information between c and w is defined as:</Paragraph> <Paragraph position="4"> jF is the total frequency counts of all words and their contexts. A well-known problem with mutual information is that it is biased towards infrequent words/features. We therefore multiplied mi w,c with a discounting factor:</Paragraph> <Paragraph position="6"> We compute the similarity between two words</Paragraph> <Paragraph position="8"> using the cosine coefficient (Salton and McGill 1983) of their mutual information vectors:</Paragraph> <Paragraph position="10"/> </Section> <Section position="5" start_page="22" end_page="22" type="metho"> <SectionTitle> 4 CBC Algorithm </SectionTitle> <Paragraph position="0"> CBC consists of three phases. In Phase I, we compute each elements top-k similar elements.</Paragraph> <Paragraph position="1"> In our experiments, we used k = 20. In Phase II, we construct a collection of tight clusters, where the elements of each cluster form a committee.</Paragraph> <Paragraph position="2"> The algorithm tries to form as many committees as possible on the condition that each newly formed committee is not very similar to any existing committee. If the condition is violated, the committee is simply discarded. In the final phase of the algorithm, each element is assigned to its most similar cluster.</Paragraph> <Paragraph position="3"> 4.1. Phase I: Find top-similar elements Computing the complete similarity matrix between pairs of elements is obviously quadratic. However, one can dramatically reduce the running time by taking advantage of the fact that the feature vector is sparse. By indexing the features, one can retrieve the set of elements that have a given feature. To compute the top similar words of a word w, we first sort w's features according to their mutual information with w.</Paragraph> <Paragraph position="4"> We only compute pairwise similarities between w and the words that share a high mutual information feature with w.</Paragraph> <Paragraph position="5"> 4.2. Phase II: Find committees The second phase of the clustering algorithm recursively finds tight clusters scattered in the similarity space. In each recursive step, the algorithm finds a set of tight clusters, called committees, and identifies residue elements that are not covered by any committee. We say a committee covers an element if the elements similarity to the centroid of the committee exceeds some high similarity threshold. The algorithm then recursively attempts to find more committees among the residue elements. The output of the algorithm is the union of all committees found in each recursive step. The details of Phase II are presented in Figure 1.</Paragraph> <Paragraph position="6"> In Step 1, the score reflects a preference for bigger and tighter clusters. Step 2 gives preference to higher quality clusters in Step 3, where a cluster is only kept if its similarity to all previously kept clusters is below a fixed threshold. In our experiments, we set th Step 1: For each element e [?] E Cluster the top similar elements of e from S using average-link clustering.</Paragraph> <Paragraph position="7"> For each cluster discovered c compute the following score: |c |avgsim(c), where |c |is the number of elements in c and avgsim(c) is the average pairwise similarity between elements in c.</Paragraph> <Paragraph position="8"> Store the highest-scoring cluster in a list L.</Paragraph> <Paragraph position="9"> Step 2: Sort the clusters in L in descending order of their scores.</Paragraph> <Paragraph position="10"> Step 3: Let C be a list of committees, initially empty.</Paragraph> <Paragraph position="11"> For each cluster c [?] L in sorted order Compute the centroid of c by averaging the frequency vectors of its elements and computing the mutual information vector of the centroid in the same way as we did for individual elements.</Paragraph> <Paragraph position="12"> If cs similarity to the centroid of each committee previously added to C is below a threshold th Step 4 terminates the recursion if no committee is found in the previous step. The residue elements are identified in Step 5 and if no residues are found, the algorithm terminates; otherwise, we recursively apply the algorithm to the residue elements.</Paragraph> <Paragraph position="13"> Each committee that is discovered in this phase defines one of the final output clusters of the algorithm.</Paragraph> <Paragraph position="14"> 4.3. Phase III: Assign elements to clusters In Phase III, every element is assigned to the cluster containing the committee to which it is most similar. This phase resembles K-means in that every element is assigned to its closest centroid. Unlike K-means, the number of clusters is not fixed and the centroids do not change (i.e. when an element is added to a cluster, it is not added to the committee of the cluster).</Paragraph> </Section> <Section position="6" start_page="22" end_page="22" type="metho"> <SectionTitle> 5 Evaluation Methodology </SectionTitle> <Paragraph position="0"> Many cluster evaluation schemes have been proposed. They generally fall under two categories: * comparing cluster outputs with manually generated answer keys (hereon referred to as classes); or * embedding the clusters in an application and using its evaluation measure.</Paragraph> <Paragraph position="1"> An example of the first approach considers the average entropy of the clusters, which measures the purity of the clusters (Steinbach, Karypis, and Kumar 2000). However, maximum purity is trivially achieved when each element forms its own cluster. An example of the second approach evaluates the clusters by using them to smooth probability distributions (Lee and Pereira 1999). Like the entropy scheme, we assume that there is an answer key that defines how the elements are supposed to be clustered. Let C be a set of clusters and A be the answer key. We define the editing distance, dist(C, A), as the number of operations required to make C consistent with A. We say that C is consistent with A if there is a one to one mapping between clusters in C and the classes in A such that for each cluster c in C, all elements of c belong to the same class in A. We allow two editing operations: * merge two clusters; and * move an element from one cluster to another.</Paragraph> <Paragraph position="2"> Let B be the baseline clustering where each element is its own cluster. We define the quality of a set of clusters C as follows:</Paragraph> <Paragraph position="4"> Suppose the goal is to construct a clustering consistent with the answer key. This measure can be interpreted as the percentage of operations saved by starting from C versus starting from the baseline.</Paragraph> <Paragraph position="5"> We aim to construct a clustering consistent with A as opposed to a clustering identical to A because some senses in A may not exist in the corpus used to generate C. In our experiments, we extract answer classes from WordNet. The word dog belongs to both the Person and Animal classes. However, in the newspaper corpus, the Person sense of dog is at best extremely rare.</Paragraph> <Paragraph position="6"> There is no reason to expect a clustering algorithm to discover this sense of dog. The baseline distance dist(B, A) is exactly the number of elements to be clustered.</Paragraph> <Paragraph position="7"> We made the assumption that each element belongs to exactly one cluster. The transformation procedure is as follows: 1. Suppose there are m classes in the answer key. We start with a list of m empty sets, each of which is labeled with a class in the answer key.</Paragraph> <Paragraph position="8"> 2. For each cluster, merge it with the set whose class has the largest number of elements in the cluster (a tie is broken arbitrarily).</Paragraph> <Paragraph position="9"> 3. If an element is in a set whose class is not the same as one of the elements classes, move the element to a set where it belongs. null dist(C, A) is the number of operations performed using the above transformation rules on C. containing e could have been merged with either set (we arbitrarily chose the second). The total number of operations is 4.</Paragraph> </Section> <Section position="7" start_page="22" end_page="13403" type="metho"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> We generated clusters from a news corpus using CBC and compared them with classes extracted from WordNet (Miller 1990).</Paragraph> <Section position="1" start_page="22" end_page="13403" type="sub_section"> <SectionTitle> 6.1. Test Data </SectionTitle> <Paragraph position="0"> To extract classes from WordNet, we first estimate the probability of a random word belonging to a subhierarchy (a synset and its hyponyms). We use the frequency counts of synsets in the SemCor corpus (Landes, Leacock, Tengi 1998) to estimate the probability of a subhierarchy. Since SemCor is a fairly small corpus, the frequency counts of the synsets in the lower part of the WordNet hierarchy are very sparse. We smooth the probabilities by assuming that all siblings are equally likely given the parent. A class is then defined as the maximal subhierarchy with probability less than a threshold (we used e</Paragraph> <Paragraph position="2"> (Lin 1994), a broad-coverage English parser, to parse about 1GB (144M words) of newspaper text from the TREC collection (1988 AP Newswire, 1989-90 LA Times, and 1991 San Jose Mercury) at a speed of about 500 words/second on a PIII-750 with 512MB memory. We collected the frequency counts of the grammatical relationships (contexts) output by Minipar and used them to compute the pointwise mutual information values from Section 3. The test set is constructed by intersecting the words in WordNet with the nouns in the corpus whose total mutual information with all of its contexts exceeds a threshold m. Since WordNet has a low coverage of proper names, we removed all capitalized nouns. We constructed two test sets: S consisting of 13403 words (m = 250) and S consisting of 3566 words (m = 3500). We then removed from the answer classes the words that did not occur in the test sets. Table 1 summarizes the test sets. The sizes of the WordNet classes vary a lot. For S there are 99 classes that contain three words or less and the largest class contains 3246 words. For S We clustered the test sets using CBC and the clustering algorithms of Section 2 and applied the evaluation methodology from the previous section. Table 2 shows the results. The columns are our editing distance based evaluation measure. Test set S has a higher score for all algorithms because it has a higher number of average features per word than S .</Paragraph> <Paragraph position="3"> For the K-means and Buckshot algorithms, we set the number of clusters to 250 and the maximum number of iterations to 8. We used a sample size of 2000 for Buckshot. For the Bisecting K-means algorithm, we applied the basic K-means algorithm twice (a = 2 in Section 2) with a maximum of 8 iterations per split. Our implementation of Chameleon was unable to complete clustering S in reasonable time due to its time complexity.</Paragraph> <Paragraph position="4"> Table 2 shows that K-means, Buckshot and Average-link have very similar performance. CBC outperforms all other algorithms on both data sets.</Paragraph> </Section> <Section position="2" start_page="13403" end_page="13403" type="sub_section"> <SectionTitle> 6.3. Manual Inspection </SectionTitle> <Paragraph position="0"> Let c be a cluster and wn(c) be the WordNet class that has the largest intersection with c. The precision of c is defined as: CBC discovered 943 clusters. We sorted them according to their precision. Table 3 shows five of the clusters evenly distributed according to their precision ranking along with their Top-15 features with highest mutual-information. The words in the clusters are listed in descending order of their similarity to the cluster centroid. For each cluster c, we also include wn(c). The underlined words are in wn(c). The first cluster is clearly a cluster of firearms and the second one is of pests. In WordNet, the word pest is curiously only under the person hierarchy. The words stopwatch and houseplant do not belong to the clusters but they have low similarity to their cluster centroid. The third cluster represents some kind of control. In WordNet, the legal power sense of jurisdiction is not a hyponym of social control as are supervision, oversight and governance. The fourth cluster is about mixtures. The words blend and mix as the event of mixing are present in WordNet but not as the result of mixing. The last cluster is about consumers. Here is the consumer class in WordNet 1.5: addict, alcoholic, big spender, buyer, client, concert-goer, consumer, customer, cutter, diner, drinker, drug addict, drug user, drunk, eater, feeder, fungi, head, heroin addict, home buyer, junkie, junky, lush, nonsmoker, patron, policyholder, purchaser, reader, regular, shopper, smoker, spender, subscriber, sucker, taker, user, vegetarian, wearer In our cluster, only the word client belongs to WordNets consumer class. The cluster is ranked very low because WordNet failed to consider words like patient, tenant and renter as consumers.</Paragraph> <Paragraph position="1"> Table 3 shows that even the lowest ranking CBC clusters are fairly coherent. The features associated with each cluster can be used to classify previously unseen words into one or more existing clusters.</Paragraph> <Paragraph position="2"> Table 4 shows the clusters containing the word cell that are discovered by various clustering algorithms from S . The underlined words represent the words that belong to the cell class in WordNet. The CBC cluster corresponds almost exactly to WordNets cell class. K-means and Buckshot produced fairly coherent clusters. The cluster constructed by Bisecting K-means is obviously of inferior quality. This is consistent with the fact that Bisecting K-means has a much lower score on S along with their features with top-15 highest mutual information and the WordNet classes that have the largest intersection with each cluster.</Paragraph> </Section> </Section> <Section position="8" start_page="13403" end_page="13403" type="metho"> <SectionTitle> RANK MEMBERS TOP-15 FEATURES </SectionTitle> <Paragraph position="0"> wn(c) 1 handgun, revolver, shotgun, pistol, rifle, machine gun, sawed-off shotgun, submachine gun, gun, automatic pistol, automatic rifle, firearm, carbine, ammunition, magnum, cartridge, automatic, stopwatch __ blast, barrel of __ , brandish __, fire __, point __, pull out __, __ discharge, __ fire, __ go off, arm with __, fire with __, kill with __, open fire with __, shoot with __, threaten with __ artifact / artifact 236 whitefly, pest, aphid, fruit fly, termite, mosquito, cockroach, flea, beetle, killer bee, maggot, predator, mite, houseplant, cricket __ control, __ infestation, __ larvae, __ population, infestation of __, specie of __, swarm of __ , attract __, breed __, eat __, eradicate __, feed on __, get rid of __, repel __, ward off __ animal / animate being / beast / brute / creature / fauna 471 supervision, discipline, oversight, control, governance, decision making, jurisdiction breakdown in __, lack of __ , loss of __, assume __, exercise __, exert __, maintain __, retain __, seize __, tighten __, bring under __, operate under __, place under __, put under __, remain under __ act / human action / human activity 706 blend, mix, mixture, combination, juxtaposition, combine, amalgam, sprinkle, synthesis, hybrid, melange dip in __, marinate in __, pour in __, stir in __, use in __, add to __, pour __, stir __, curious __, eclectic __, ethnic __, odd __, potent __, unique __, unusual __ group / grouping 941 employee, client, patient, applicant, tenant, individual, participant, renter, volunteer, recipient, caller, internee, enrollee, giver benefit for __, care for __, housing for __, benefit to __, service to __, filed by __, paid by __, use by __, provide for __, require for --, give to __, offer to __, provide to __, disgruntled __, indigent __ worker</Paragraph> </Section> class="xml-element"></Paper>