File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1112_evalu.xml
Size: 8,984 bytes
Last Modified: 2025-10-06 13:58:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1112"> <Title>Syntactic features for high precision Word Sense Disambiguation</Title> <Section position="6" start_page="0" end_page="1" type="evalu"> <SectionTitle> 6. Experimental setting and results. </SectionTitle> <Paragraph position="0"> We used the Senseval-2 data (73 nouns, verbs and adjectives), keeping the original training and testing sets. In order to measure the contribution of syntactic features the following experiments were devised (not all ML algorithms were used in all experiments, as specified): contribution of IGR-type and GR-type relations (Dlist), contribution of syntactic features over a combination of local and topical features (Dlist, Boost), and contribution of syntactic features in a high precision system (Dlist, Boost).</Paragraph> <Paragraph position="1"> Performance is measured as precision and coverage (following the definitions given in to compare the overall performance as it gives the harmonic average between precision and recall (where recall is in this case precision times the coverage). F1 can be used to select the best precision/coverage combination (cf. section 6.3). 6.1. Results for different sets of syntactic features (Dlist).</Paragraph> <Paragraph position="2"> Table 1 shows the precision, coverage and F1 figures for each of the grammatical feature sets as used by the decision list algorithm.</Paragraph> <Paragraph position="3"> Instantiated Grammatical Relations provide very good precision, but low coverage. The only exceptions are verbs, which get very similar precision for both kinds of syntactic relations. Grammatical Relations provide lower precision but higher coverage. A combination of both attains best F1, and is the feature set used in subsequent experiments.</Paragraph> <Paragraph position="4"> features, local features, a combination of local+topical features (also called basic), and a combination of all features (basic+syntax) in turn. Table 2 shows the F1 figures for each algorithm, feature set and PoS.</Paragraph> <Paragraph position="5"> All in all, Boost is able to outperform Dlist in all cases, except for local features. Syntactic features get worse results than local features. Regarding the contribution of syntactic features to the basic set, the last two columns in Table 2 show a &quot;+&quot; whenever the difference in the precision over the basic feature set is significant (McNemar's test). Dlist is able to scarcely profit from the additional syntactic features (only significant for verbs). Boost attains significant improvement, showing that basic and syntactic features are complementary. The difference be algorithms could be expl Dlist is a conservativ that it only uses the by the first feature that ho (abstaining if none o using a combination o single-feature classife negative evidence) B positive predictions to m Dlist. Since the feat covered and given tha accurate, Boost achiev it is a significantly approaching a 100% cov 6.3. Precision vs. coverage: high precision systems (Dlist, Boost) Figure 1 shows the results for the three methods to exploit the precision/coverage trade-off in order to obtain a high-precision system. For each method two sets of features have been used: the basic set alone and the combination of both basic and syntactic features.</Paragraph> <Paragraph position="6"> The figure reveals an interesting behavior for different coverage ranges. In the high coverage range, Boost on basic+syntactic features attains the best performance. In the medium coverage area, the feature selection method for Dlist obtains the best results, also for basic+syntactic features. Finally, in the low coverage and high precision area the decision-threshold method for Dlist is able to reach precisions in the high 90's, with no profit from syntactic features.</Paragraph> <Paragraph position="7"> The two methods to raise precision for Dlists are very effective. The decision-threshold tween the two ML ained by the fact that e algorithm in the sense positive information given lds in the test example f them are applicable). By f the predictions of several rs (using both positive and oost is able to assign ore test examples than ure space is more widely t the classifiers are quite es better recall levels and better algorithm for method obtains constant increase i up to 93% precision with 7% cov feature selection method attains 86 with 26% coverage using synta there is no further improvement.</Paragraph> <Paragraph position="8"> In this case Dlist is able to obta good accuracy rates (at the cost o restricting to the use of the m features. On the contrary, we hav in adjusting the AdaBoost a obtaining high precision predicti The figure also shows, fo 20%, that the syntactic feat allow for better results, confirm erage WSD system. features improve the results of the basic set. 7. Conclusions and further work.</Paragraph> <Paragraph position="9"> This paper shows that syntactic features effectively contribute to WSD precision. We have extracted syntactic relations using the Minipar parser, but the results should be also applicable to other parsers with similar performance. Two kinds of syntactic features are defined: Instantiated Grammatical Relations (IGR) between words, and Grammatical Relations (GR) coded as the presence of adjuncts / arguments in isolation or as subcategorization frames.</Paragraph> <Paragraph position="10"> The experimental results were tried on the Senseval-2 data, comparing two different ML algorithms (Dlist and Boost) trained both on a basic set of widely used features alone, and on a combination of basic and syntactic features. The main conclusions are the following: * IGR get better precision than GR, but the best precision/coverage combination (measured with F1) is attained by the combination of both.</Paragraph> <Paragraph position="11"> * Boost is able to profit from the addition of syntactic features, obtaining better results than Dlist. This proves that syntactic features contain information that is not present in other traditional features.</Paragraph> <Paragraph position="12"> * Overall the improvement is around two points for Boost, with highest increase for verbs.</Paragraph> <Paragraph position="13"> Several methods to exploit the precision-coverage trade-off where also tried: * The results show that syntactic features consistently improve the results on all data points except in the very low coverage range, confirming the contribution of syntax. * The results also show that Dlist are suited to build a system with high precision: either a precision of 86% and a coverage of 26%, or 95% precision and 8% coverage.</Paragraph> <Paragraph position="14"> Regarding future work, a thorough analysis of the quality of each of the syntactic relations extracted should be performed. In addition, a word-by-word analysis would be interesting, as some words might profit from specific syntactic features, while others might not. A preliminary analysis has been performed in (Agirre & Martinez, 2001b).</Paragraph> <Paragraph position="15"> Other parsers rather than Minipar could be used. In particular, we found out that Minipar always returns unambiguous trees, often making erroneous attachment decisions. A parser returning ambiguous output could be more desirable. The results of this paper do not depend on the parser used, only on the quality of the output, which should be at least as good as Minipar.</Paragraph> <Paragraph position="16"> Concerning the performance of the algorithm as compared to other Senseval 2 systems, it is not the best. Getting the best results was not the objective of this paper, but to show that syntactic features are worth including. We plan to improve the pre-processing of our systems, the detection of multiword lexical entries, etc. which could improve greatly the results. In addition there can be a number of factors that could diminish or disguise the improvement in the results: hand-tagging errors, word senses missing from training or testing data, biased sense distributions, errors in syntactic relations, etc. Factor out this &quot;noise&quot; could show the real extent of the contribution of syntactic features. On the other hand, we are using a high number of features. It is well known that many ML algorithms have problems to scale to high dimensional feature spaces, especially when the number of training examples is relatively low (as it is the case for Senseval-2 word senses). Researching on more careful feature selection (which is dependent of the ML algorithm) could also improve the contribution of syntactic features, and WSD results in general. In addition, alternative methods to produce a high precision method based on Boost need to be explored.</Paragraph> <Paragraph position="17"> Finally, the results on high precision WSD open the avenue for acquiring further examples in a bootstrapping framework.</Paragraph> </Section> class="xml-element"></Paper>