File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1018_metho.xml
Size: 7,406 bytes
Last Modified: 2025-10-06 14:10:08
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1018"> <Title>Word Sense Induction: Triplet-Based Clustering and Automatic Evaluation</Title> <Section position="4" start_page="137" end_page="139" type="metho"> <SectionTitle> 3 Triplet-based algorithm </SectionTitle> <Paragraph position="0"> The algorithm proposed in this work is based on the one sense per collocation observation (Gale et al., 1992). That essentially means that whenever a pair of words co-occurs significantly often in a corpus (hence a collocation), the concept referenced by that pair is unambiguous, e.g.</Paragraph> <Paragraph position="1"> growing plant vs. power plant. However, as also pointed out by Yarowsky (1995), this observation does not hold uniformly over all possible co-occurrences of two words. It is stronger for adjacent co-occurrences or for word pairs in a predicate-argument relationship than for arbitrary associations at equivalent distance, e.g. a plant is much less clear-cut. To alleviate this problem, the first step of the presented algorithm is to build triplets of words (target word and two of it's cooccurrences) instead of pairs (target word and one co-occurrence). This means that a plant is further restricted by another word and even a stop word such as on rules several possibilities of interpretation of a plant out or at least makes them a lot less improbable.</Paragraph> <Paragraph position="2"> The algorithm was applied to two types of co-occurrence data. In order to show the influence of window size, both the most significant sentence-wide co-occurrences and direct neighbour co-occurrences were computed for each word. The significance values are obtained using the log-likelihood measure assuming a binomial distribution for the unrelatedness hypothesis (Dunning, 1993). For each word, only the 200 most significant co-occurrences were kept. This threshold and all others to follow were chosen after experiment- null ing with the algorithm. However, as will be shown in section 4, the exact set-up of these numbers does not matter. The presented evaluation method enables to find the optimal configuration of parameters automatically using a genetic algorithm.</Paragraph> <Paragraph position="3"> The core assumption of the triplet-based algorithm is, that any three (or more) words either uniquely identify a topic, concept or sense.</Paragraph> <Paragraph position="4"> Using the previously acquired most significant co-occurrences (of both types), the lists of co-occurrences for all three words of a triplet are intersected to retain words contained in all three lists. If the three words cover a topic, e.g. space, NASA, Mars, then the intersection will not be empty, e.g. launch, probe, cosmonaut, .... If the three words do not identify a meaningful topic, e.g. space, NASA, cupboard, then the intersection will most likely contain few to no words at all. Intersectionsoftripletsbuiltfromfunctionwordsare null very likely to contain many co-occurrences even if they do not identify a unique topic. These socalled'stopwords'arethusremovedbothfromthe null co-occurrences from which triplets are built and from the co-occurrences which are used as features. null It is straightforward then to create all possible triplets of the co-occurrences of the target word w and to compute the intersection of their co-occurrence lists. Using these intersections as features of the triplets, it is possible to group triplets of words together that have similar features by means of any standard clustering algorithm. However, in order to 'tie' the referenced meanings of the triplets to the target word w, the resulting set of triplets can be restricted only to those that also contain the target word. This has the useful side-effect that it reduces the number of triplets to cluster. To further reduce the remaining number of parenleftbig2002 parenrightbig = 19900 items to be clustered, an iterative incremental windowing mechanism has been added. Instead of clustering all triplets in one step, 30 co-occurrences beginning from the most significant ones are taken in each step to build parenleftbig302 parenrightbig = 435 triplets and their intersections. The resulting elements (triplets and intersections of their respective co-occurrences as features) are thenclusteredwiththeclustersremainingfromthe previous step.</Paragraph> <Paragraph position="5"> In each step of the clustering algorithm, the words from the triplets and the features are merged, if the overlap factor similarity measure (Curran, 2003) found them to be similar enough (over 80% overlapping words out of 200). Thus, if the element (space, NASA, Mars) : (orbital, satellite, astronauts,...) and (space, launch, Mars) : (orbit, satellite, astronaut, ...) were found to be similar, they are merged to (space=2, NASA=1,</Paragraph> <Paragraph position="7"> measureutilizesonlythefeaturesforcomparisons, the result can contain two or more clusters having almost identical key sets (which result from merging triplets). A post-clustering step is therefore applied in order to compare clusters by the formerly triplet words and merge spurious sense distinctions. After having thus established the final clusters, the words that remain unclustered can be classified to the resulting clusters. Classification is performed by comparing the co-occurrences of each remaining word to the agglomerated feature words of each sense. If the overlap similarity to the most similar sense is below 0.8 the given word is not classified. The entire cluster algorithm can then be summarized as follows: * Target word is w * for each step take the next 30 co-occurrences of w - Build all possible pairs of the 30 co-occurrences and add w to each to make them triplets - Compute intersections of co-occurrences of each triplet - Cluster the triplets using their intersections as features together with clusters remaining from previous step [?] Whenever two clusters are found to belong together, both the words from the triplets and the features are merged together, increasing their counts * Cluster results of the loop by using the merged words of the triplets as features * Classify unused words to the resulting clusters if possible Inordertoreducenoise, forexampleintroduced bytripletsofunrelatedwordsstillcontainingafew words, there is a threshold of minimum intersection size which was set to 4. Another parameter worth mentioning is that after the last clustering step all clusters are removed which contain less than 8 words. Keeping track of how many times a given word has 'hit' a certain cluster (in each merging step) enables to add a post-processing step. In this step a word is removed from a cluster if it has 'hit' another cluster significantly more often.</Paragraph> <Paragraph position="8"> There are several issues and open questions that arise from this entire approach. Most obviously, why to use a particular similarity measure, a particular clustering method or why to merge the vectors instead of creating proper centroids. It is possible that another combination of decisions of this kind would produce better results. However, the overall observation is that the results are fairly stable with respect to such decisions whereas parameters such as frequency of the target word, size of the corpus, balance of the various senses and others have a much greater impact.</Paragraph> </Section> class="xml-element"></Paper>