File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2404_metho.xml
Size: 20,862 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2404"> <Title>Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Experimental Data </SectionTitle> <Paragraph position="0"> We conducted experiments using part of speech tagged and parsed versions of the SENSEVAL-2, SENSEVAL1, line, hard, serve and interest data. The packages posSenseval and parseSenseval part of speech tagged and parsed the data, respectively. posSenseval uses the Brill Tagger while parseSenseval employs the Collins Parser. We used the training and test data divisions that already exist in the SENSEVAL-2 and SENSEVAL-1 data. However, the line, hard, serve and interest data do not have a standard division, so we randomly split the instances into test (20%) and training (80%) portions.</Paragraph> <Paragraph position="1"> The SENSEVAL-2 and SENSEVAL-1 data were created for comparative word sense disambiguation exercises held in the summers of 2001 and 1998, respectively. The SENSEVAL-2 data consists of 4,328 test instances and 8,611 training instances and include a total of 73 nouns, verbs and adjectives. The training data has the target words annotated with senses from WordNet. The target words have a varied number of senses ranging from two for collaborate, graceful and solemn to 43 for turn.</Paragraph> <Paragraph position="2"> The SENSEVAL-1 data has 8,512 test and 13,276 training instances, respectively. The number of possible senses for these words range from 2 to 15, and are tagged with senses from the dictionary Hector.</Paragraph> <Paragraph position="3"> The line data (Leacock, 1993) consists of 4,149 instances where the noun line is used in one of six possible WordNet senses. This data was extracted from the 19871989 Wall Street Journal (WSJ) corpus, and the American Printing House for the Blind (APHB) corpus. The distribution of senses is somewhat skewed with more than 50% of the instances used in the product sense while all the other instances more or less equally distributed among the other five senses.</Paragraph> <Paragraph position="4"> The hard data (Leacock, 1998) consists of 4,337 instances taken from the San Jose Mercury News Corpus (SJM) and are annotated with one of three senses of the adjective hard, from WordNet. The distribution of instances is skewed with almost 80% of the instances used in the not easy - difficult sense.</Paragraph> <Paragraph position="5"> The serve data (Leacock, 1998) consists of 5,131 instances with the verb serve as the target word. They are annotated with one of four senses from WordNet. Like line it was created from the WSJ and APHB corpora.</Paragraph> <Paragraph position="6"> The interest data (Bruce, 1994) consists of 2,368 instances where the noun interest is used in one of six senses taken from the Longman Dictionary of Contemporary English (LDOCE). The instances are extracted from the part of speech tagged subset of the Penn Treebank Wall Street Journal Corpus (ACL/DCI version).</Paragraph> </Section> <Section position="5" start_page="1" end_page="2" type="metho"> <SectionTitle> 4 Experiments and Discussion </SectionTitle> <Paragraph position="0"> TheSyntaLexword sense disambiguation package was used to carry out our experiments. It uses the C4.5 algorithm, as implemented by the J48 program in the Waikato Environment for Knowledge Analysis (Witten and Frank, 2000) to learn a decision tree for each word to be disambiguated. null We use the majority classifier as a baseline point of comparison. This is a classifier that assigns all instances to the most frequent sense in the training data. Our system defaults to the majority classifier if it lacks any other recourse, and therefore it disambiguates all instances. We thus, report our results in terms of accuracy. Table 1 shows our overall experimental results, which will be discussed in the sections that follow. Note that the results of the majority classifier appear at the bottom of that table, and that the most accurate result for each set of of data is shown in bold face.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Lexical Features </SectionTitle> <Paragraph position="0"> We utilized the following lexical features in our experiments: the surface form of the target word, unigrams and bigrams. The entries under Lexical in Table 1 show disambiguation accuracy when using those features individually. null It should be noted that the experiments for the SENSEVAL-2 and SENSEVAL-1 data using unigrams and bigrams are re-implementations of (Pedersen, 2001a), and that our results are comparable. However, the experiments on line, hard, serve and interest have been carried out for the first time.</Paragraph> <Paragraph position="1"> We observe that in general, surface form does not improve significantly on the baseline results provided by the majority classifier. While in most of the data (SENSEVAL-2, line, hard and serve data) there is hardly any improvement, we do see noticeable improvements in SENSEVAL-1 and interest data. We believe that this is due to the nature of the feature. Certain words have many surface forms and senses. In many such cases, certain senses can be represented by a restricted subset of possible surface forms. Such words are disambiguated better than others using this feature.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 4.2 Part of Speech Features </SectionTitle> <Paragraph position="0"> Word sense disambiguation using individual part of speech features is done in order to compare the effect of single POS features versus possibly more powerful combination part of speech features. They are not expected to be powerful enough to do very good classification but may still capture certain intuitive notions. For example, it is very likely that if the noun line is preceded by a wh word such as whose or which, it is used in the phone line sense. If the noun line is preceded by a preposition, say in or of, then there is a good chance that line has been used in the formation sense. The accuracies achieved by part of speech features on SENSEVAL-2, SENSEVAL-1, line, hard, serve and interest data are shown in Table 1. The individual part of speech feature results are under POS, and the combinations under POS Combos.</Paragraph> <Paragraph position="1"> We observe that the individual part of speech features result in accuracies that are significantly better than the majority classifier for all the data except for the line and hard. Like the surface form, we believe that the part of speech features are more useful to disambiguate certain words than others. We show averaged results for the SENSEVAL-2 and SENSEVAL-1, and even there the part of speech features fare well. In addition, when looking at a more detailed breakdown of the 73 and 36 words included in these samples respectively, a considerable number of those words experience improved accuracy using part of speech features.</Paragraph> <Paragraph position="2"> In particular, we observed that while verbs and adjectives are disambiguated best by part of speech of words one or two positions on their right (P</Paragraph> <Paragraph position="4"> ), nouns in general are aided by the part of speech of immediately adjacent words on either side (P</Paragraph> <Paragraph position="6"> ). In the case of transitive verbs (which are more frequent in this data than intransitive verbs), the words at positions P</Paragraph> <Paragraph position="8"> usually the objects of the verb (for example, drink water).</Paragraph> <Paragraph position="9"> Similarly, an adjective is usually immediately followed by the noun which it qualifies (for example, short discussion). Thus, in case of both verbs and adjectives, the word immediately following (P ) is likely to be a noun having strong syntactic relation to it. This explains the higher accuracies for verbs and adjectives using P and would imply high accuracies for nouns using P [?]1 , which too we observe. However, we also observe high accuracies for nouns using P . This can be explained by the fact that nouns are often the subjects in a sentence and the words</Paragraph> <Paragraph position="11"> may be the syntactically related verbs, which aid in disambiguation.</Paragraph> <Paragraph position="12"> To summarize, verbs are aided by P is the the most potent individual part of speech feature to disambiguate a set of noun, verb and adjective target words. A combination of parts of speech of words surrounding (and possibly including) the target word may better capture the overall context than single part of speech features. Following is an example of how a combination of part of speech features may help identify the intended sense of the noun line. If the target word line is used in the plural form, is preceded by a personal pronoun and the word following it is not a preposition, then it is likely that the intended sense is line of text as in the actor forgot his lines or they read their lines slowly. However, if the word preceding line is a personal pronoun and the word following it is a preposition, then it is probably used in the product sense, as in, their line of clothes. POS Combos in Table 1 shows the accuracies achieved using such combinations with the SENSEVAL-2, SENSEVAL-1, line, hard, serve and interest data. Again due to space constraints we do not give a break down of the accuracies for the SENSEVAL-2 and SENSEVAL-1 data for noun, verb and adjective target words.</Paragraph> <Paragraph position="13"> We note that decision trees based on binary features representing the possible values of a given sequence of part of speech tags outperforms one based on individual features. The combinations which include P obtain higher accuracies. In the the case of the verbs and adjectives in SENSEVAL-2 and SENSEVAL-1 data, the best results are obtained using the parts of speech of words following the target word. The nouns are helped by parts of speech of words on both sides. This is in accordance with the hypothesis that verbs and adjectives have strong syntactic relations to words immediately following while nouns may have strong syntactic relations on either side. However, the hard and serve data are found to be helped by features from both sides. We believe this is because of the much larger number of instances per task in case of hard and serve data as compared to the adjectives and verbs in SENSEVAL-1 and SENSEVAL-2 data. Due to the smaller amount of training data available for SENSEVAL-2 and SENSEVAL-1 words, only the most potent features help. The power of combining features is highlighted by the significant improvement of accuracies above the base-line for the line and hard data, which was not the case using individual features (Table 1).</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Parse Features </SectionTitle> <Paragraph position="0"> We employed the following parse features in these experiments: the head word of the phrase housing the target word, the type of phrase housing the target word (Noun phrase, Verb Phrase, etc), the head of the parent phrase, and the type of parent phrase. These results are shown under Parse in Table 1.</Paragraph> <Paragraph position="1"> The head word feature yielded the best results in all the data except line, where the head of parent phrase is most potent. Further, the nouns and adjectives benefit most by the head word feature. We believe this the case because the head word is usually a content word and thus likely to be related to other nouns in the vicinity. Nouns are usually found in noun phrases or prepositional phrases.</Paragraph> <Paragraph position="2"> When part of a noun phrase, the noun is likely to be the head and thus does not benefit much from the head word feature. In such cases, the head of the parent phrase may prove to be more useful as is the case in the line data.</Paragraph> <Paragraph position="3"> In case of adjectives, the relation of the head word to the target word is expected to be even stronger as it is likely to be the noun modified by the adjective (target word). The verb is most often found in a verb phrase and is usually the head word. Hence, verb target words are not expected to be benefited by the head word feature, which is what we find here. The phrase housing the target word and the parent phrase were not found to be beneficial when used individually.</Paragraph> <Paragraph position="4"> Certain parse features, such as, the phrase of the target word, take very few distinct values. For example, the target word shirt may occur in at most just two distinct kinds of phrases: noun phrase and prepositional phrase. Such features are not expected to perform much better than the majority classifier. However, when used in combination with other features, they may be useful. Thus, like part of speech features, experiments were conducted using a combination of parse features in an effort to better capture the context and to identify sets of features which work well together. Consider the parse features head word and parent word. Head words such as magazine, situation and story are indicative of the quality of causing attention to be given sense of interest while parent words such as accrue and equity are indicative of the interest rate sense. A classifier based on both features can confidently classify both kinds of instances. Table 1 has the results under Parse Combos. The Head and Head of Parent combinations have in general yielded significantly higher accuracies than simply the head word or any other parse feature used individually. The improvement is especially noteworthy in case of line, serve and interest data. The inclusion of other features along with these two does not help much more. We therefore find the Head and Head of Parent combination to be the most potent parse feature combination. It may be noted that a break down of accuracies (not shown here for sake of brevity) for noun, verb and adjective target words, of the SENSEVAL-1 and SENSEVAL-2 data revealed that the adjectives were disambiguated best using the Head word and Phrase combination. This is observed in the hard data results as well,</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Complementary/Redundant Features </SectionTitle> <Paragraph position="0"> As can be observed in the previous results, many different kinds of features can lead to roughly comparable word sense disambiguation results.</Paragraph> <Paragraph position="1"> Different types of features are expected to be redundant to a certain extent. In other words, the features will individually classify an identical subset of the instances correctly. Likewise, the features are expected to be complementary to some degree, that is, while one set of features correctly disambiguates a certain subset of instances, use of another set of features results in the correct disambiguation of an entirely distinct subset of the instances.</Paragraph> <Paragraph position="2"> The extent to which the feature sets are complementary and redundant justify or obviate the combining of the feature sets. In order to accurately capture the amount of redundancy and complementarity among two feature sets, we introduce two measures: the Baseline Ensemble and the Optimal Ensemble. Consider the scenario where the outputs of two classifiers based on different feature sets are to be combined using a simple voting or ensemble technique for word sense disambiguation.</Paragraph> <Paragraph position="3"> The Baseline Ensemble is the accuracy attained by a hypothetical ensemble technique which correctly disambiguates an instance only when both the classifiers identify the intended sense correctly. In effect, the Baseline Ensemble quantifies the redundancy among the two feature sets. The Optimal Ensemble is the accuracy of a hypothetical ensemble technique which accurately disambiguates an instance when any of the two classifiers correctly disambiguates the intended sense. We say that these are hypothetical in that they can not be implemented, but rather serve as a post disambiguation analysis technique.</Paragraph> <Paragraph position="4"> Thus, the Optimal Ensemble is the upper bound to the accuracy achievable by combining the two feature sets using an ensemble technique. If the accuracies of individual classifiers is X and Y, the Optimal Ensemble can be defined as follows:</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> OptimalEnsemble =(X [?]BaselineEnsemble)+ (Y [?]BaselineEnsemble)+BaselineEnsemble </SectionTitle> <Paragraph position="0"> We use a simple ensemble technique to combine some of the best lexical and syntactic features identified in the previous sections. The probability of a sense to be the intended sense as identified by lexical and syntactic features is summed. The sense which attains the highest score is chosen as the intended sense. Table 2 shows the best results achieved using this technique along with the baseline and optimal ensembles for the SENSEVAL2, SENSEVAL-1, line, hard, serve and interest data. The table also presents the feature sets that achieved these results. In addition, the last column of this table shows representative values for some of the best results attained in the published literature for these data sets. Note that these are only approximate points of comparison, in that there are differences in how individual experiments are conducted for all of the non-SENSEVAL data.</Paragraph> <Paragraph position="1"> From the Baseline Ensemble we observe that there is a large amount of redundancy across the feature sets. That said, there is still a significant amount of complementarity as may be noted by the difference between the Optimal Ensemble and the greater of the individual accuracies.</Paragraph> <Paragraph position="2"> For example, in the SENSEVAL-2 data, unigrams alone achieve 55.3% accuracy and part of speech features attain an accuracy of 54.6%. The Baseline Ensemble attains accuracy of 43.6%, which means that this percentage of the test instances are correctly tagged, independently, by both unigrams and part of speech features. The unigrams get an additional 11.7% of the instances correct which the part of speech features tag incorrectly.</Paragraph> <Paragraph position="3"> Similarly, the part of speech features are able to correctly tag an additional 11% of the instances which are tagged erroneously when using only bigrams. The above values suggest a high amount of redundancy among the unigrams and part of speech features but not high enough to suggest that there is no significant benefit in combining the two kinds of features. The difference between the Optimal Ensemble and the accuracy attained by unigrams is 12.6% (67.9% - 55.3%). This is a significant improvement in accuracy which may be achieved by a suitable ensemble technique. The difference is a quantification of the complementarity between unigram and part of speech features based on the data. Further, we may conclude that given these unigram and part of speech features, the best ensemble techniques will not achieve accuracies higher than 67.9%.</Paragraph> <Paragraph position="4"> It may be noted that a single unified classifier based on multiple features may achieve accuracies higher than the Optimal Ensemble. However, we show that an accurate ensemble method (Optimal Ensemble), based on simple lexical and syntactic features, achieves accuracies comparable or better than some of the best previous results. The point here is that using information from two distinct feature sets (lexical features and part of speech) could lead to state of the art results. However, it is as yet unclear how to most effectively combine such simple classifiers to achieve these optimal results.</Paragraph> <Paragraph position="5"> Observation of the pairs of lexical and syntactic features which provide highest accuracies for the various data suggest that the part of speech combination feature - null , is likely to be most complementary with the lexical features (bigrams or unigrams).</Paragraph> <Paragraph position="6"> The hard data did particularly well with combinations of parse features, the Head and Parent words. The Optimal Ensemble attains accuracy of over 91%, while the best previous results were approximately 83%. This indicates that not only are the Head and Parent word features very useful in disambiguating adjectives but are also a source of complementary information to lexical features.</Paragraph> </Section> class="xml-element"></Paper>