File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1194_metho.xml
Size: 23,985 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1194"> <Title>Discovering word senses from a network of lexical cooccurrences</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Networks of lexical cooccurrences </SectionTitle> <Paragraph position="0"> The method we present in this article for discovering word senses was applied both for French and English. Hence, two networks of lexical cooccurrences were built: one for French, from the Le Monde newspaper (24 months between 1990 and 1994), and one for English, from the L.A. Times newspaper (2 years, part of the TREC corpus). The size of each corpus was around 40 million words.</Paragraph> <Paragraph position="1"> The building process was the same for the two networks. First, the initial corpus was pre-processed in order to characterize texts by their topically significant words. Thus, we retained only the lemmatized form of plain words, that is, nouns, verbs and adjectives. Cooccurrences were classically extracted by moving a fixed-size window on texts. Parameters were chosen in order to catch topical relations: the window was rather large, 20word wide, and took into account the boundaries of texts; moreover, cooccurrences were indifferent to word order. As (Church and Hanks, 1990), we adopted an evaluation of mutual information as a cohesion measure of each cooccurrence. This measure was normalized according to the maximal mutual information relative to the considered corpus. After filtering the less significant cooccurrences (cooccurrences with less than 10 occurrences and cohesion lower than 0.1), we got a network with approximately 23,000 words and</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 million cooccurrences for French, 30,000 </SectionTitle> <Paragraph position="0"> words and 4.8 million cooccurrences for English.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="1" type="metho"> <SectionTitle> 4 Word sense discovery algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 4.1 Building of the similarity matrix be- </SectionTitle> <Paragraph position="0"> tween cooccurrents The number and the extent of the clusters built by a clustering algorithm generally depend on a set of parameters that can be tuned in one way or another. But this possibility is implicitly limited by the similarity measure used for comparing the elements to cluster. In our case, the elements to cluster are the cooccurrents in the network of lexical cooccurrences of the word whose senses have to be discriminated. Within the same framework, we tested two measures for evaluating the similarity between the cooccurrents of a word in order to get word senses with different levels of granularity. The first measure corresponds to the cohesion measure between words in the cooccurrence network. If there is no relation between two words in the network, the similarity is equal to zero. This measure has the advantage of being simple and efficient from an algorithmic viewpoint but some semantic relations are difficult to catch only from cooccurrences in texts. For instance, we experimentally noticed that there are few synonyms of a word among its cooccurrents . Hence, we can expect that some senses that are discriminated by the algorithm actually refer to one sense.</Paragraph> <Paragraph position="1"> To overcome this difficulty, we also tested a measure that relies not only on first order cooccurrences but also on second order cooccurrences, which are known to be &quot;less sparse and more robust&quot; than first order ones (Schutze, 1998). This measure is based on the following principle: a vector whose size is equal to the number of cooccurrents of the considered word is associated to each of its cooccurrents. This vector contains the cohesion values between this cooccurrent and the other ones. As for the first measure, a null value is taken when there is no relation between two words in the cooccurrence network. The similarity matrix is then built by applying the cosine measure between This observation comes from the intersection, for each word of the L.A. Times network, of its cooccurrents in the network and its synonyms in WordNet.</Paragraph> <Paragraph position="2"> each couple of vectors, i.e. each couple of cooccurrents. With this second measure, two cooccurrents can be found strongly linked even though they are not directly linked in the cooccurrence network: they just have to share a significant number of words with which they are linked in the cooccurrence network.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 The Shared Nearest Neighbors (SNN) </SectionTitle> <Paragraph position="0"> algorithm The SNN algorithm is representative of the algorithms that perform clustering by detecting the high-density areas of a similarity graph. In such a graph, each vertex represents an element to cluster and an edge links two vertices whose similarity is not null. In our case, the similarity graph directly corresponds to the cooccurrence network with the first order cooccurrences whereas with the second order cooccurrences, it is built from the similarity matrix described in Section 4.1. The SNN algorithm can be split up into two main steps: the first one aims at finding the elements that are the most representative of their neighborhood by masking the less important relations in the similarity graph. These elements are the seeds of the final clusters that are built in the second step by aggregating the remaining elements to those selected by the first step. More precisely, the SNN algorithm is applied to the discovering of the senses of a target word as follows: 1. sparsification of the similarity graph: for each cooccurrent of the target word, only the links towards the k (k=15 in our experiments) most similar other cooccurrents are kept.</Paragraph> <Paragraph position="1"> 2. building of the shared nearest neighbor graph: this step only consists in replacing, in the sparsified graph, the value of each edge by the number of direct neighbors shared by the two cooccurrents linked by this edge.</Paragraph> <Paragraph position="2"> 3. computation of the distribution of strong links among cooccurrents: as for the first step, this one is a kind of sparsification. Its aim is to help finding the seeds of the senses, i.e. the cooccurrents that are the most representative of a set of cooccurrents. This step is also a means for discarding the cooccurrents that have no relation with the other ones. More precisely, two cooccurrents are considered as strongly linked if the number of the neighbors they share is higher than a fixed threshold. The higher than a fixed threshold. The number of strong links of each cooccurrent is then computed. null 4. identification of the sense seeds and filtering of noise: the sense seeds and the cooccurrents to discard are determined by comparing their number of strong links with a fixed threshold.</Paragraph> <Paragraph position="3"> 5. building of senses: this step mainly consists in associating to the sense seeds identified by the previous step the remaining cooccurrents that are the most similar to them. The result is a set of clusters that each represents a sense of the target word. For associating a cooccurrent to a sense seed, the strength of the link between them must be higher than a given threshold. If a cooccurrent can be tied to several seeds, the one that is the most strongly linked to it is chosen. Moreover, the seeds that are considered as too close from each other for giving rise to separate senses can also be grouped during this step in accordance with the same criteria than the other cooccurrents.</Paragraph> <Paragraph position="4"> 6. extension of senses: after the previous steps, a set of cooccurrents that are not considered as noise are still not associated to a sense. The size of this set depends on the strictness of the threshold controlling the aggregation of a cooccurrent to a sense seed but as we are interested in getting homogeneous senses, the value of this threshold cannot be too low. Nevertheless, we are also interested in having a definition as complete as possible of each sense. As senses are defined at this point more precisely than at step 4, the integration into these senses of cooccurrents that are not strongly linked to a sense seed can be performed on a larger basis, hence in a more reliable way.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.3 Adaptation of the SNN algorithm </SectionTitle> <Paragraph position="0"> For implementing the SNN algorithm presented in the previous section, one of the points that must be specified more precisely is the way its different thresholds are fixed. In our case, we chose the same method for all of them: each threshold is set as a quantile of the values it is applied to. In this way, it is adapted to the distribution of these values. For the identification of the sense seeds (threshold equal to 0.9) and for the definition of the cooccurrents that are noise (threshold equal to LM-1 LM-2 LAT-1 LAT-1.no LAT-2.no number of words 17,261 17,261 13,414 6,177 6,177 percentage of words with at least one sense 44.4% 42.7% 39.8% 41.8% 39% average number of senses by word 2.8 2.2 1.6 1.9 1.5 average number of words describing a sense 16.1 16.3 18.7 20.2 18.9 0.2), the thresholds are quantiles of the number of strong links of cooccurrents. For defining strong links (threshold equal to 0.65), associating cooccurrents to sense seeds (threshold equal to 0.5) and aggregating cooccurrent to senses (threshold equal to 0.7), the thresholds are quantiles of the strength of the links between cooccurrents in the shared nearest neighbor graph.</Paragraph> <Paragraph position="1"> We also introduced two main improvements to the SNN algorithm. The first one is the addition of a new step between the last two ones. This comes from the following observation: although a sense seed can be associated to another one during the step 5, which means that the two senses they represent are merged, some clusters that actually correspond to one sense are not merged. This problem is observed with the first and the second order cooccurrences and cannot be solved, without merging unrelated senses, only by adjusting the threshold that controls the association of a cooccurrent to a sense seed. In most of these cases, the &quot;split&quot; sense is scattered over one large cluster and one or several small clusters that only contain 3 or 4 cooccurrents. More precisely, the sense seeds of the small clusters are not associated to the seed of the large cluster while most of the cooccurrents that are linked to them are associated to this seed.</Paragraph> <Paragraph position="2"> Instead of defining a specific mechanism for dealing with these small clusters, we chose to let the SNN algorithm to solve the problem by only deleting these small clusters (size < 6) after the step 5 and marking their cooccurrents as unclassified.</Paragraph> <Paragraph position="3"> The last step of the algorithm aggregates in most of the cases these cooccurrents to the large cluster.</Paragraph> <Paragraph position="4"> Moreover, this new step makes the built senses more stable when the parameters of the algorithm are only slightly modified.</Paragraph> <Paragraph position="5"> The second improvement, which has a smaller impact than the first one, aims at limiting the noise that is brought into clusters by the last step. In the algorithm of (Ertoz et al., 2001), an element can be associated to a cluster when the strength of its link with one of the elements of this cluster is higher than a given threshold. This condition is stricter in our case as it concerns the average strength of the links between the unclassified cooccurrent and those of the cluster.</Paragraph> </Section> </Section> <Section position="6" start_page="1" end_page="1" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We applied our algorithm for discovering word senses to the two networks of lexical cooccurrences we have described in Section 3 (LM: French; LAT: English) with the parameters given in Section 4. For each network, we tested the use of first order cooccurrences (LM-1 and LAT-1) and second order ones (LM-2 and LAT-2). For English, the use of second order cooccurrences was tested only for the subpart of the words of the network that was selected for the evaluation of Section 6 (LAT-2.no). Table 1 gives some statistics about the results of the discovered senses for the different cases. We can notice that a significant percentage of words do not have any sense, even with second order cooccurrences. This comes from the fact that their cooccurrents are weakly linked to each other in the cooccurrence network they are part of, which probably means that their senses are not actually represented in this network. We can also notice that the use of second order cooccurrence actually leads to have a smaller number of senses by word, hence to have senses with a larger definition. As Veronis (2003), we give in Table 2 as an example of the results of our algorithm some of the words defining the senses of the polysemous French word barrage, which was part of the ROMANSEVAL evaluation. Whatever the kind of cooccurrences it relies on, our algorithm finds three of the four senses distinguished in (Veronis, 2003): dam (senses 1.3 and 2.1); barricading, blocking (senses 1.1, 1.2 and 2.2); barrier, frontier (senses 1.4 and 2.3). The sense play-off game (match de barrage), which refers to the domain of sport, is not found as it is weakly represented in the cooccurrence network and is linked to words, such as division, that are also ambiguous (it refers LM-1 1.1 manifestant, forces_de_l'ordre, prefecture, agriculteur, protester, incendier, calme, pierre (demonstrator, the police, prefecture, farmer, to protest, to burn, quietness, stone) both to the sport and the military domains). It should be note that barrage has only 1,104 occurrences in our corpus while it has 7,000 occurrences in the corpus of (Veronis, 2003), built by crawling from the Internet the pages found by a meta search engine queried with this word and its morphological variants. This example is also a good illustration of the difference of granularity of the senses built from first order cooccurrences and those built from the second order ones. The sense 1.1, which is close to the sense 1.2 as the two refers to demonstrations in relation to a category of workers, disappears when the second order cooccurrences are used. Table 3 gives examples of discovered senses from first order cooccurrences only, one for French (LM-1) and two for English (LAT-1).</Paragraph> </Section> <Section position="7" start_page="1" end_page="3" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> The discovering of word senses, as most of the work dedicated to the building of linguistic resources, comes up against the problem of evaluating its results. The most direct way of doing it is to compare the resource to evaluate with a similar resource that is acknowledged as a golden standard. For word senses, the WordNet-like lexico-semantic networks can be considered as such a standard. Using this kind of networks for evaluating the word senses that we find is of course criticizable as our aim is to overcome their insufficiencies. Nevertheless, as these networks are carefully controlled, such an evaluation provides at least a first judgment about the reliability of the discovered senses. We chose to take up the evaluation method proposed in (Pantel and Lin, 2002). This method relies on WordNet and shows a rather good agreement between its results and human judgments (88% for Pantel and Lin). As a consequence, our evaluation was done only for English, and more precisely with WordNet 1.7. For each considered word, the evaluation method tries to map one of its discovered senses with one of its synsets in WordNet by applying a specific similarity measure. Hence, only the precision of the word sense discovering algorithm is evaluated but Pantel and Lin indicate that recall is not very significant in this context: a discovered sense may be correct and not present in WordNet and conversely, some senses in WordNet are very close and should be joined for most of the applications using WordNet. They define a recall measure but only for ranking the results of a set of systems.</Paragraph> <Paragraph position="1"> Hence, it cannot be applied in our case.</Paragraph> <Paragraph position="2"> The similarity measure between a sense and a synset used for computing precision relies on the Lin's similarity measure between two synsets: where s is the most specific synset that subsumes s1 and s2 in the WordNet hierarchy and P(s) represents the probability of the synset s estimated from a reference corpus, in this case the SemCor corpus. We used the implementation of this measure provided by the Perl module WordNet ::Similarity v0.06 (Patwardhan and Pedersen, 2003). The similarity between a sense and a synset is more precisely defined as the average value of the similarity values between the words that characterize the sense, or a subset of them, and the synset. The similarity between a word and a synset organe (1300) patient, transplantation, greffe, malade, therapeutique, medical, medecine, greffer, rein (patient, transplantation, transplant, sick person, therapeutic, medical, medicine, to transplant, kidney) procreation, embryon, ethique, humain, relatif, bioethique, corps_humain, gene, cellule (procreation, embryo, ethical, human, relative, bioethics, human body, gene, cell) constitutionnel, consultatif, constitution, instituer, executif, legislatif, sieger, disposition (constitutional, consultative, constitution, to institute, executive, legislative, to sit, clause) article, hebdomadaire, publication, redaction, quotidien, journal, editorial, redacteur (article, weekly, publication, editorial staff, daily, newspaper, editorial, sub-editor) mouse (563) compatible, software, computer, machine, user, desktop, pc, graphics, keyboard, device laboratory, researcher, cell, gene, generic, human, hormone, research, scientist, rat party (16999) candidate, democrat, republican, gubernatorial, presidential, partisan, reapportionment ballroom, cocktail, champagne, guest, bash, gala, wedding, birthday, invitation, festivity caterer, uninvited, party-goers, black-tie, hostess, buffet, glitches, napkins, catering is equal to the highest similarity value among those between the synset and the synsets to which the word belongs to. Each of these values is given by (1). A sense is mapped to the synset that is the most similar to it, providing that the similarity between them is higher than a fixed threshold (equal to 0.25 as in (Pantel and Lin, 2002)). Finally, the precision for a word is given by the proportion of its senses that match one of its synsets.</Paragraph> <Paragraph position="3"> Table 4 gives the results of the evaluation of our algorithm for the words of the English cooccurrence network that are nouns only and for which at least one sense was discovered. As Pantel and Lin, we only take into account for evaluation 4 words of each sense, whatever the number of words that define it. But, because of the way our senses are built, we have not a specific measure of the similarity between a word and the words that characterize its senses. Hence, we computed two variants of the precision measure. The first one selects the four words of each sense by relying on their number of strong links in the shared nearest neighbor graph. The second one selects the four words that have the highest similarity score with one of the synsets of the target word, which is called &quot;optimal choice&quot; in Table 4 . A clear difference can be noted between the two variants. With the optimal choice of the four words, we get results that are similar to those of Pantel and Lin: their precision is equal to 60.8 with an average number of words defining a sense equal to 14.</Paragraph> <Paragraph position="4"> Each word is given with its frequency in the corpus used for building the cooccurrence network.</Paragraph> <Paragraph position="5"> This selection procedure is only used for evaluation and we do no rely on WordNet for building our senses.</Paragraph> <Paragraph position="6"> On the other hand, Table 4 shows that the words selected on the basis of their number of strong links are not strongly linked in WordNet (according to Lin's measure) to their target word. This does not mean that the selected words are not interesting for describing the senses of the target word but more probably that the semantic relations that they share with the target word are different from hyperonymy. The results of Pantel and Lin can be explained by the fact that their algorithm is based on the clustering of similar words, i.e. words that are likely to be synonyms, and not on the clustering of the cooccurrents of a word, which are not often synonyms of that word. Moreover, their initial corpus is much larger (around 144 millions words) than ours and they make use of more elaborated tools, such as a syntactic analyzer.</Paragraph> <Paragraph position="7"> LAT-1.no LAT-2.no number of strong links 19.4 20.8 optimal choice 56.2 63.7 for English in relation with WordNet As expected, the results obtained with first order cooccurrences (LAT-1.no), which produce a higher number of senses by word, are lower than the results obtained with second order cooccurrences (LAT-2.no). However, without a recall measure, it is difficult to draw a clear conclusion from this observation: some senses of LAT-1.no probably result from the artificial division of an actual word sense but the fact to have more homogeneous senses in LAT-2.no also facilitates in this case the mapping with WordNet's synsets.</Paragraph> </Section> <Section position="8" start_page="3" end_page="3" type="metho"> <SectionTitle> 7 Related work </SectionTitle> <Paragraph position="0"> As they rely on the detection of high-density areas in a network of cooccurrences, (Veronis, 2003) and (Dorow and Widdows, 2003) are the closest methods to ours. Nevertheless, two main differences can be noted with our work. The first one concerns the direct use they make of the network of cooccurrences. In our case, we chose a more general approach by working at the level of a similarity graph: when the similarity of two words is given by their relation of cooccurrence, our situation is comparable to the one of (Veronis, 2003) and (Dorow and Widdows, 2003); but in the same framework, we can also take into account other kinds of similarity relations, such as the second order cooccurrences.</Paragraph> <Paragraph position="1"> The second main difference is the fact they discriminate senses in an iterative way. This approach consists in selecting at each step the most obvious sense and then, to update the graph of cooccurrences by discarding the words that make up the new sense. The other senses are then easier to discriminate. We preferred to put emphasis on the ability to gather close or identical senses that are artificially distinguished (see Section 4.3). From a global viewpoint, these two differences lead (Veronis, 2003) and (Dorow and Widdows, 2003) to build finer senses than ours. Nevertheless, as methods for discovering word senses from a corpus tend to find a too large number of close senses, it was more important from our viewpoint to favour the building of stable senses with a clear definition rather than to discriminate very fine senses.</Paragraph> </Section> class="xml-element"></Paper>