File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2404_intro.xml
Size: 6,403 bytes
Last Modified: 2025-10-06 14:02:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2404"> <Title>Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 Feature Space </SectionTitle> <Paragraph position="0"> We employ lexical and syntactic features in our word sense disambiguation experiments. The lexical features are unigrams, bigrams, and the surface form of the target word, while the syntactic features are part of speech tags and various components from a parse tree.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.1 Lexical Features </SectionTitle> <Paragraph position="0"> The surface form of a target word may restrict its possible senses. Consider the noun case which has the surface forms: case, cases and casing. These have the following senses: object of investigation, frame or covering and a weird person. Given an occurrence of the surface form casing, we can immediately conclude that it was used in the sense of a frame or covering and not the other two.</Paragraph> <Paragraph position="1"> Each possible surface form as observed in the training data is represented as a binary feature, and indicates if that particular surface form occurs (or not).</Paragraph> <Paragraph position="2"> Unigrams are individual words that appear in the text.</Paragraph> <Paragraph position="3"> Consider the following sentence: the judge dismissed the case (2) Here the, judge, dismissed, the and case are unigrams.</Paragraph> <Paragraph position="4"> Both judge and dismissed suggest that case has been used in the judicial sense and not the others. Every unigram that occurs above a certain frequency threshold in the training corpus is represented as a binary feature. For example, there is a feature that represents whether or not judge occurs in the context of a target word.</Paragraph> <Paragraph position="5"> Bigrams are pairs of words that occur in close proximity to each other, and in a particular order. For example, in the following sentence: the interest rate is lower in state banks (3) the interest, interest rate, rate is, is lower, lower in, in state and state banks are bigrams, where interest rate suggests that bank has been used in the financial institution sense and not the river bank sense. Every bigram that reaches a given frequency and measure of association score threshold is represented as a binary feature. For example, the bigram feature interest rate has value of 1 if it occurs in the context of the target word, and 0 if it does not.</Paragraph> <Paragraph position="6"> We use the Ngram Statistics Package to identify frequent unigrams and statistically significant bigrams in the training corpus for a particular word. However, unigrams or bigrams that occur commonly in text are ignored by specifying a stop list composed mainly of prepositions, articles and conjunctions.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Part of Speech Features </SectionTitle> <Paragraph position="0"> The parts of speech of words around the target word are also useful clues for disambiguation. It is likely that when used in different senses, the target word will have markedly different configuration of parts of speech around it. The following sentences have the word turn in changing sides/parties sense and changing course/direction senses, respectively:</Paragraph> <Paragraph position="2"> Observe that the parts of speech following each occurrence of turn are significantly different, and that this distinction can be captured both by individual and combinations of part of speech features.</Paragraph> <Paragraph position="3"> The parts of speech of individual words at particular positions relative to the target word serve as features. The part of speech of the target word is P . The POS of words following the target are denoted by P etc. There is a binary feature for each part of speech tag observed in the training corpus at the given position or positions of interest.</Paragraph> <Paragraph position="4"> Suppose we would like to use part of speech features for the target word and one word to the right of the target. If the target word has 3 different parts of speech observed in the training data, and the word to the right (without regard to what that word is) has 32 different part of speech tags, then there will be 35 binary features that represent the occurrence of those tags at those positions.</Paragraph> <Paragraph position="5"> We also consider combinations of part of speech tags as features. These indicate when a particular sequence of part of speech tags occurs at a given set of positions. These features are boolean, and indicate if a particular sequence of tags has occurred or not. In the scenario above, there would be 96 different binary features represented, each of which indicates if a particular combination of values for the two positions of interest, occurs.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Parse Features </SectionTitle> <Paragraph position="0"> A sentence is made up of multiple phrases and each phrase, in turn, is made of phrases or words. Each phrase has a head word which may have strong syntactic relations with other words in the sentence. Consider the phrases, her hard work and the hard surface. The head words work and surface are indicative of the calling for stamina/endurance and not easily penetrable senses of hard.</Paragraph> <Paragraph position="1"> Thus, the head word of the phrase housing the target word is used as a feature. The head word of its parent phrase is also suggestive of the intended sense of the target word. Consider the sentence fragments fasten the line and cross the line. The noun phrases (the line) have the verbs fasten and cross as the head of parent phrases. Verb fasten is indicative of the cord sense of line while cross suggests the division sense.</Paragraph> <Paragraph position="2"> The phrase housing the target word and the parent phrase are also used as features. For example, phrase housing the target word is a noun phrase, parent phrase is a verb phrase and so on. Similar to the part of speech features, all parse features are boolean.</Paragraph> </Section> </Section> class="xml-element"></Paper>