File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2007_evalu.xml
Size: 10,335 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2007"> <Title>Word Sense Disambiguation Using Automatically Translated Sense Examples</Title> <Section position="6" start_page="48" end_page="50" type="evalu"> <SectionTitle> 5 Experiments on Senseval-3 </SectionTitle> <Paragraph position="0"> In this section we describe the experiments carried out on the Senseval-3 lexical sample dataset. First, we introduce a heuristic method to deal with the problem of fine-grainedness of WordNet senses.</Paragraph> <Paragraph position="1"> The remaining two subsections will be devoted to the experiments of the baseline system and the contribution of the heuristic to the final system.</Paragraph> <Paragraph position="2"> The main difference to our hand-tagged evaluation, apart from the ML algorithm, is that we did not remove the bias from the &quot;one sense per discourse&quot; factor, as she did. each threshold the number of removed senses/tokens and ambiguity are shown.</Paragraph> <Section position="1" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 5.1 Unsupervised methods on fine-grained senses </SectionTitle> <Paragraph position="0"> When applying unsupervised WSD algorithms to fine-grained word senses, senses that rarely occur in texts often cause problems, as these cases are difficult to detect without relying on hand-tagged data. This is why many WSD systems use sense-tagged corpora such as SemCor to discard or penalise low-frequency senses.</Paragraph> <Paragraph position="1"> For our work, we did not want to rely on hand-tagged corpora, and we devised a method to detect low-frequency senses and to remove them before using our translation-based approach. The method is based on the hypothesis that word senses that have few close relatives (synonyms, hypernyms, and hyponyms) tend to have low frequency in corpora. We collected all the close relatives to the target senses, according to WordNet, and then removed all the senses that did not have a number of relatives above a given threshold. We used this method on nouns, as the WordNet hierarchy is more developed for them.</Paragraph> <Paragraph position="2"> First, we observed the effect of sense removal in the SemCor corpus. For all the polysemous nouns, we applied different thresholds (4-10 relatives) and measured the percentage of senses and SemCor tokens that were removed. Our goal was to remove as many senses as we could, while keeping as many tokens as possible. Table 3 shows the results of the process on all a5</Paragraph> <Paragraph position="4"> nouns in SemCor for a total of 18,912 senses and 70,238 tokens. The average number of senses per token initially is</Paragraph> <Paragraph position="6"> For the lowest threshold (4) we can see that we are able to remove a large number of senses from consideration (40%), keeping 85% of the tokens in SemCor. Higher thresholds can remove more senses, but it forces us to discard more valid tokens. In Table 3, the best ratios are given by lower thresholds, suggesting that conservative ap- null proaches would be better. However, we have to take into account that unsupervised state-of-the-art WSD methods on fine-grained senses perform below 50% recall on this dataset , and therefore an approach that is more aggressive may be worth trying.</Paragraph> <Paragraph position="7"> We applied this heuristic method in our experiments and decided to measure the effect of the threshold parameter by relying on SemCor and the Senseval-3 training data. Thus, we tested the MT-based system for different threshold values, removing the senses for consideration when the relative number was below the threshold. The results of the experiments using this technique will be described in Section 5.3.</Paragraph> </Section> <Section position="2" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 5.2 Baseline system </SectionTitle> <Paragraph position="0"> We performed experiments on Senseval-3 test data with both MT-based and dictionary-based approaches. We show the results for nouns and adjectives in Table 4, together with the MFS base-line (obtained from the Senseval-3 lexical sample training data). We can see that the results are similar for nouns, while for adjectives the MT-based system achieves significantly better recall.</Paragraph> <Paragraph position="1"> Overall, the performance was much lower than our previous 2-way disambiguation. The system also ranks below the MFS baseline.</Paragraph> <Paragraph position="2"> One of the main reasons for the low performance was that senses with few examples in the test data are over-represented in training. This is because we trained the classifiers on equal number of maximumly 200 sense examples for every sense, no matter how rarely a sense actually occurs in real text. As we explained in the previous section, this problem could be alleviated for nouns by using the relative-based heuristics. We only implemented the MT-based approach for the rest of the experiments, as it performed better than the dictionary-based one.</Paragraph> </Section> <Section position="3" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 5.3 Relative threshold </SectionTitle> <Paragraph position="0"> In this section we explored the contribution of the relative-based threshold to the system. We tested the system only on nouns. In order to tune the threshold parameter, we first applied the method on SemCor and the Senseval-3 training data. We used hand-tagged corpora from two different sources to see whether the method was based methods in Senseval-3 lexical-sample data. The MFS baseline(%) and the number of testing examples are also shown.</Paragraph> <Paragraph position="1"> based threshold on Senseval-3 training data and SemCor (for nouns only). Best results shown in bold.</Paragraph> <Paragraph position="2"> generic enough to be applied on unseen test data. Note also that we used this experiment to define a general threshold for the heuristic, instead of optimising it for different words. Once the threshold is fixed, it will be used for all target words. The results of the MT-based system applying threshold values from 4 to 14 are given in Table 5. We can see clearly that the algorithm benefits from the heuristic, specially when ambiguity is reduced to around 2 senses in average. Also observe that the contribution of the threshold is quite similar for SemCor and Senseval-3 training data. From this table, we chose 11 as threshold value for the test data, as it obtained the best performance on SemCor.</Paragraph> <Paragraph position="3"> Thus, we performed a single run of the algorithm on the test data applying the chosen threshold. The performance for all nouns is given in Table 6. We can see that the recall has increased significantly, and is now closer to the MFS baseline, which is a very hard baseline for unsupervised systems (McCarthy et al., 2004). Still, the performance is significantly lower than the score achieved by supervised systems, which can reach above 72% recall (Mihalcea et al., 2004). Some of the reasons for the gap are the following: a0 The acquisition process: problems can arise Table 6:Final results(%) for all nouns in Senseval-3 test data. Together with the number of test examples and MFS baseline(%). null from ambiguous Chinese words, and the acquired examples can contain noise generated by the MT software.</Paragraph> <Paragraph position="4"> a0 Distribution of fine-grained senses: As we have seen, it is difficult to detect rare senses for unsupervised methods, while supervised systems can simply rely on frequency of senses.</Paragraph> <Paragraph position="5"> a0 Lack of local context: Our system does not benefit from local bigrams and trigrams, which for supervised systems are one of the best sources of knowledge.</Paragraph> </Section> <Section position="4" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 5.4 Comparison with Senseval-3 </SectionTitle> <Paragraph position="0"> unsupervised systems Finally, we compared the performance of our system with other unsupervised systems in the Senseval-3 lexical-sample competition. We evaluated these systems for nouns, using the outputs provided by the organisation , and focusing on the systems that are considered unsupervised. However, we noticed that most of these systems used the information of SemCor frequency, or even Senseval-3 examples in their models. Thus, we classified the systems depending on whether they used SemCor frequencies (Sc), Senseval-3 examples (S-3), or did not (Unsup.). This is an (sorted by recall(%)). Our system given in bold. important distinction, as simply knowing the most frequent sense in hand-tagged data is a big advantage for unsupervised systems (applying the MFS heuristic for nouns in Senseval-3 would achieve 54.2% precision, and 53.0% recall when using SemCor). At this point, we would like to remark that, unlike other systems using Semcor, we have applied it to the minimum extent. Its only contribution has been to indirectly set the threshold for our general heuristic based on WordNet relatives. We are exploring better ways to integrate the relative information in the model.</Paragraph> <Paragraph position="1"> The results of the Senseval-3 systems are given in Table 7. There are only 2 systems that do not require any hand-tagged data, and our method is able to improve both when using the relative-threshold. The best systems in Senseval-3 benefited from the training examples from the training data, particularly the top-scoring system, which is clearly supervised. The 2nd ranked system requires 10% of the training examples in Senseval-3 to map the clusters that it discovers automatically, and the 3rd simply applies the MFS heuristic.</Paragraph> <Paragraph position="2"> The remaining systems introduce bias of the SemCor distribution in their models, which clearly helped their performance for each word. Our system is able to obtain a similar performance to the best of those systems without relying on hand-tagged data. We also evaluated the systems on the coarse-grained sense groups provided by the Senseval-3 organisers. The results in Table 8 show that our system is comparatively better on this coarse-grained disambiguation task.</Paragraph> </Section> </Section> class="xml-element"></Paper>