File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1112_metho.xml
Size: 13,339 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1112"> <Title>Syntactic features for high precision Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Previous work. </SectionTitle> <Paragraph position="0"> Yarowsky (1994) defined a basic set of features that has been widely used (with some variations) by other WSD systems. It consisted on words appearing in a window of +-k positions around the target and bigrams and trigrams constructed with the target word. He used words, lemmas, coarse part-of-speech tags and special classes of words, such as &quot;Weekday&quot;. These features have been used by other approaches, with variations such as the size of the window, the distinction between open class/closed class words, or the pre-selection of significative words to look up in the context of the target word.</Paragraph> <Paragraph position="1"> Ng (1996) uses a basic set of features similar to those defined by Yarowsky, but they also use syntactic information: verb-object and subject-verb relations. The results obtained by the syntactic features are poor, and no analysis of the features or any reason for the low performance is given.</Paragraph> <Paragraph position="2"> Stetina et al. (1998) achieve good results with syntactic relations as features. They use a measure of semantic distance based on WordNet to find similar features. The features are extracted using a statistical parser (Collins, 1996), and consist of the head and modifiers of each phrase. Unfortunately, they do not provide a comparison with a baseline system that would only use basic features.</Paragraph> <Paragraph position="3"> The Senseval-2 workshop was held in Toulouse in July 2001 (Preiss & Yarowsky, 2001). Most of the supervised systems used only a basic set of local and topical features to train their ML systems. Regarding syntactic information, in the Japanese tasks, several groups relied on dependency trees to extract features that were used by different models (SVM, Bayes, or vector space models). For the English tasks, the team from the University of Sussex extracted selectional preferences based on subject-verb and verb-object relations. The John Hopkins team applied syntactic features obtained using simple heuristic patterns and regular expressions. Finally, WASP-bench used finite-state techniques to create a grammatical relation database, which was later used in the disambiguation process. The papers in the proceedings do not provide specific evaluation of the syntactic features, and it is difficult to derive whether they were really useful or not.</Paragraph> <Paragraph position="4"> 3. Basic feature set We have taken a basic feature set widely used in the literature, divided in topical features and local features (Agirre & Martinez, 2001b).</Paragraph> <Paragraph position="5"> Topical features correspond to open-class lemmas that appear in windows of different sizes around the target word. In this experiment, we used two different window-sizes: 4 lemmas around the target (coded as win_lem_4w), and the lemmas in the sentence plus the 2 previous and 2 following sentences (win_lem_2s).</Paragraph> <Paragraph position="6"> Local features include bigrams and trigrams (coded as big_, trig_ respectively) that contain the target word. An index (+1, -1, 0) is used to indicate the position of the target in the bigram or trigram, which can be formed by part of speech, lemmas or word forms (wf, lem, pos). We used TnT (Brants, 2000) for PoS tagging.</Paragraph> <Paragraph position="7"> For instance, we could extract the following features for the target word known from the sample sentence below: word form &quot;whole&quot; occurring in a 2 sentence window (win_wf_2s), the bigram &quot;known widely&quot; where target is the last word (big_wf_+1) and the trigram &quot;RB RB N&quot; formed by the two PoS before the target word (trig_pos_+1).</Paragraph> <Paragraph position="8"> &quot;There is nothing in the whole range of human experience more widely known and universally ...&quot;</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Set of Syntactic Features. </SectionTitle> <Paragraph position="0"> In order to extract syntactic features from the tagged examples, we needed a parser that would meet the following requirements: free for research, able to provide the whole structure with named syntactic relations (in contrast to shallow parsers), positively evaluated on well-established corpora, domain independent, and fast enough.</Paragraph> <Paragraph position="1"> Three parsers fulfilled all the requirements: Link Grammar (Sleator and Temperley, 1993), Minipar (Lin, 1993) and (Carroll & Briscoe, 2001). We installed the first two parsers, and performed a set of small experiments (John Carroll helped out running his own parser).</Paragraph> <Paragraph position="2"> Unfortunately, we did not have a comparative evaluation to help choosing the best. We performed a little comparative test, and all parsers looked similar. At this point we chose Minipar mainly because it was fast, easy to install and the output could be easily processed. The choice of the parser did not condition the design of the experiments (cf. section 7).</Paragraph> <Paragraph position="3"> From the output of the parser, we extracted different sets of features. First, we distinguish between direct relations (words linked directly in the parse tree) and indirect relations (words that are two or more dependencies apart in the syntax tree, e.g. heads of prepositional modifiers of a verb). For example, from &quot;Henry was listed on the petition as the mayor's attorney&quot; a direct verb-object relation is extracted between listed and Henry and the indirect relation &quot;head of a modifier prepositional phrase&quot; between listed and petition. For each relation we store also its inverse. The relations are coded according to the Minipar codes (cf. Appendix): [Henry obj_word listed] [listed objI_word Henry] [petition mod_Prep_pcomp-n_N_word listed] [listed mod_Prep_pcomp-n_NI_word petition] For instance, in the last relation above, mod_Prep indicates that listed has some prepositional phrase attached, pcomp-n_N indicates that petition is the head of the prepositional phrase, I indicates that it is an inverse relation, and word that the relation is between words (as opposed to relations between lemmas).</Paragraph> <Paragraph position="4"> We distinguished two different kinds of syntactic relations: instantiated grammatical relations (IGR) and grammatical relations (GR).</Paragraph> <Paragraph position="5"> 4.1. Instantiated Grammatical Relations IGRs are coded as [wordsense relation value] triples, where the value can be either the word form or the lemma. Some examples for the target noun &quot;church&quot; are shown below. In the first example, a direct relation is extracted for the &quot;building&quot; sense, and in the second example an indirect relation for the &quot;group of Christians&quot; sense.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2. Grammatical relations </SectionTitle> <Paragraph position="0"> This kind of features refers to the grammatical relation themselves. In this case, we collect bigrams [wordsense relation] and also n-grams [wordsense relation1 relation2 relation3 ...]. The relations can refer to any argument, adjunct or modifier. N-grams are similar to verbal subcategorization frames. At present, they have been used only for verbs. Minipar provides simple subcategorization information in the PoS itself, e.g. V_N_N for a verb taking two arguments. We have defined 3 types of n-grams: * Ngram1: The subcategorization information included in the PoS data given by Minipar, e.g. V_N_N.</Paragraph> <Paragraph position="1"> * Ngram2: The subcategorization information in ngram1, filtered by the arguments that actually occur in the sentence.</Paragraph> <Paragraph position="2"> * Ngram3: Which includes all dependencies in the parse tree.</Paragraph> <Paragraph position="3"> The three types have been explored in order to account for the argument/adjunct distinction, which Minipar does not always assign correctly. In the first case, Minipar's judgment is taken from the PoS. In the second case the PoS and the relations deemed as arguments are combined (adjuncts are hopefully filtered out, but some arguments might be also discarded). In the third, all relations (including adjuncts and arguments) are considered.</Paragraph> <Paragraph position="4"> In the example below, the ngram1 feature indicates that the verb has two arguments (i.e. it is transitive), which is an error of Minipar probably caused by a gap in the lexicon. The ngram2 feature indicates simply that it has a subject and no object, and the ngram3 feature denotes the presence of the adverbial modifier &quot;still&quot;. Ngram2 and ngram3 try to repair possible gaps in Minipar's lexicon.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. ML algorithms. </SectionTitle> <Paragraph position="0"> In order to measure the contribution of syntactic relations, we wanted to test them on several ML algorithms. At present we have chosen one algorithm which does not combine features (Decision Lists) and another which does combine features (AdaBoost).</Paragraph> <Paragraph position="1"> Despite their simplicity, Decision Lists (Dlist for short) as defined in Yarowsky (1994) have been shown to be very effective for WSD (Kilgarriff & Palmer, 2000). Features are weighted with a log-likelihood measure, and arranged in an ordered list according to their weight. In our case the probabilities have been estimated using the maximum likelihood estimate, smoothed adding a small constant (0.1) when probabilities are zero. Decisions taken with negative values were discarded (Agirre & Martinez, 2001b).</Paragraph> <Paragraph position="2"> AdaBoost (Boost for short) is a general method for obtaining a highly accurate classification rule by linearly combining many weak classifiers, each of which may be only moderately accurate (Freund, 1997). In these experiments, a generalized version of the Boost algorithm has been used, (Schapire, 1999), which works with very simple domain partitioning weak hypotheses (decision stumps) with confidence rated predictions. This particular boosting algorithm is able to work efficiently in very high dimensional feature spaces, and has been applied, with significant success, to a number of NLP disambiguation tasks, including word sense disambiguation (Escudero et al., 2000). Regarding parametrization, the smoothing parameter has been set to the default value (Schapire, 1999), and Boost has been run for a fixed number of rounds (200) for each word. No optimization of these parameters has been done at a word level. When testing, the sense with the highest prediction is assigned.</Paragraph> <Paragraph position="3"> 5.1. Precision vs. coverage trade-off.</Paragraph> <Paragraph position="4"> A high-precision WSD system can be obtained at the cost of low coverage, preventing the system to return an answer in the lowest confidence cases. We have tried two methods on Dlists, and one method on Boost.</Paragraph> <Paragraph position="5"> The first method is based on a decision-threshold (Dagan and Itai, 1994): the algorithm rejects decisions taken when the difference of the maximum likelihood among the competing senses is not big enough. For this purpose, a one-tailed confidence interval was created so we could state with confidence 1 - a that the true value of the difference measure was bigger than a given threshold (named th). As in (Dagan and Itai, 1994), we adjusted the measure to the amount of evidence. Different values of th were tested, using a 60% confidence interval. The values of th range from 0 to 4. For more details check (Agirre and Martinez, 2001b).</Paragraph> <Paragraph position="6"> The second method is based on feature selection (Agirre and Martinez, 2001a). Ten-fold cross validation on the training data for each word was used to measure the precision of each feature in isolation. Thus, the ML algorithm would be used only on the features with precision exceeding a given threshold. This method has the advantage of being able to set the desired precision of the final system.</Paragraph> <Paragraph position="7"> In the case of Boost, there was no straightforward way to apply the first method.</Paragraph> <Paragraph position="8"> The application of the second method did not yield satisfactory results, so we turned to directly use the support value returned for each decision being made. We first applied a threshold directly on this support value, i.e.</Paragraph> <Paragraph position="9"> discarding decisions made with low support values. A second approximation, which is the one reported here, applies a threshold over the difference in the support for the winning sense and the second winning sense. Still, further work is needed in order to investigate how Boost could discard less-confident results.</Paragraph> </Section> class="xml-element"></Paper>