File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1004_metho.xml
Size: 25,841 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1004"> <Title>Modeling Consensus: Classifier Combination for Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Related work in classifier combination is discussed throughout this article. For the specific task of word sense disambiguation, the first empirical study was presented in Kilgarriff and Rosenzweig (2000), where the authors combined the output of the participating SENSEVAL1 systems via simple (nonweighted) voting, using either Absolute Majority, Relative Majority, or Unanimous voting. Stevenson and Wilks (2001) presented a classifier combination framework where 3 disambiguation methods (simulated annealing, subject codes and selectional restrictions) were combined using the TiMBL memory-based approach (Daelemans et al., 1999).</Paragraph> <Paragraph position="1"> Pedersen (2000) presents experiments with an ensemble of Naive Bayes classifiers, which outperform all previous published results on two ambiguous words (line and interest).</Paragraph> </Section> <Section position="5" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 The WSD Feature Space </SectionTitle> <Paragraph position="0"> The feature space is a critical factor in classifier design, given the need to fuel the diverse strengths of the component classifiers. Thus its quality is often highly correlated with performance. For this Association for Computational Linguistics.</Paragraph> <Paragraph position="1"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 25-32. Proceedings of the Conference on Empirical Methods in Natural An ancient stone church stands amid the fields, the sound of bells ...</Paragraph> <Paragraph position="2"> Feat. Type Word POS Lemma the SENSEVAL2 word church reason, we used a rich feature space based on raw words, lemmas and part-of-speech (POS) tags in a variety of positional and syntactical relationships to the target word. These positions include traditional unordered bag-of-word context, local bigram and trigram collocations and several syntactic relationships based on predicate-argument structure. Their use is illustrated on a sample English sentence for the target word church in Figure 1. While an extensive evaluation of feature type to WSD performance is beyond the scope of this paper, Section 6 sketches an analysis of the individual feature contribution to each of the classifier types.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Part-of-Speech Tagging and Lemmatization </SectionTitle> <Paragraph position="0"> Part-of-speech tagger availability varied across the languages that are studied here. An electronically available transformation-based POS tagger (Ngai and Florian, 2001) was trained on standard labeled data for English (Penn Treebank), Swedish (SUC1 corpus), and Basque. For Spanish, an minimally supervised tagger (Cucerzan and Yarowsky, 2000) was used. Lemmatization was performed using an existing trie-based supervised models for English, and a combination of supervised and unsupervised methods (Yarowsky and Wicentowski, 2000) for all the other languages.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Syntactic Features </SectionTitle> <Paragraph position="0"> The syntactic features extracted for a target word depend on the word's part of speech: AF verbs: the head noun of the verb's object, particle/preposition and prepositional object; AF nouns: the headword of any verb-object, subject-verb or noun-noun relationships identified for the target word; AF adjectives: the head noun modified by the adjective. null The extraction process was performed using heuristic patterns and regular expressions over the parts-of-speech surrounding the target word This section briefly introduces the 6 classifier models used in this study. Among these models, the Naive Bayes variants (NB henceforth) (Pedersen, 1998; Manning and Schutze, 1999) and Cosine differ slightly from off-the-shelf versions, and only the differences will be described.</Paragraph> </Section> <Section position="3" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 4.1 Vector-based Models: Enhanced Naive Bayes and Cosine Models </SectionTitle> <Paragraph position="0"> Many of the systems used in this research share a common vector representation, which captures traditional bag-of-words, extended ngram and predicate-argument features in a single data structure. In these models, a vector is created for each document in the collection: CS BP B4CS . Confusion between the same word participating in multiple feature roles is avoided by appending the feature values with their positional type (e.g. stands_Sbj, ancient_L are distinct from stands and ancient in unmarked bag-of-words context).</Paragraph> <Paragraph position="1"> The notable difference between the extended models and others described in the literature, aside from the use of more sophisticated features than the traditional bag-of-words, is the variable weighting of feature types noted above. These differences yield a boost in the NB performance (relative to basic Naive Bayes) of between 3.5% (Basque) and 10% (Spanish), with an average improvement of 7.25% over the four languages.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 The BayesRatio Model </SectionTitle> <Paragraph position="0"> The BayesRatio model (BR henceforth) is a vector-based model using the likelihood ratio framework described in Gale et al. (1992): The feature extraction on the in English data was performed by first identifying text chunks, and then using heuristics on the chunks to extract the syntactic information. The weight CF</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> CY </SectionTitle> <Paragraph position="0"> depends on the type of the feature CU</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> CY : for </SectionTitle> <Paragraph position="0"> the bag-of-word features, this weight is inversely proportional to the distance between the target word and the feature, while for predicate-argument and extended ngram features it is a empirically estimated weight (on a per language basis).</Paragraph> <Paragraph position="2"> where CMD7 is the selected sense, CS denotes documents and CU denotes features. By utilizing the binary ratio for k-way modeling of feature probabilities, this approach performs well on tasks where the data is sparse.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 The MMVC Model </SectionTitle> <Paragraph position="0"> The Mixture Maximum Variance Correction classifier (MMVC henceforth) (Cucerzan and Yarowsky, 2002) is a two step classifier. First, the sense probability is computed as a linear mixture where the probability C8 B4D7CYDBB5 is estimated from data and C8 B4DBCYCSB5 is computed as a weighted normalized similarity between the word DB and the target word DC (also taking into account the distance in the document between DB and DC). In a second pass, the sense whose variance exceeds a theoretically motivated threshold is selected as the final sense label (for details, see Cucerzan and Yarowsky (2002)).</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.4 The Discriminative Models </SectionTitle> <Paragraph position="0"> Two discriminative models are used in the experiments presented in Section 5 - a transformation-based learning system (TBL henceforth) (Brill, 1995; Ngai and Florian, 2001) and a non-hierarchical decision lists system (DL henceforth) (Yarowsky, 1996). For prediction, these systems utilize local n-grams around the target word (up to 3 words/lemma/POS to the left/right), bag-of-words and lemma/collocation (A620 words around the target word, grouped by different window sizes) and the syntactic features listed in Section 3.2.</Paragraph> <Paragraph position="1"> The TBL system was modified to include redundant rules that do not improve absolute accuracy on training data in the traditional greedy training algorithm, but are nonetheless positively correlated with a particular sense. The benefit of this approach is that predictive but redundant features in training context may appear by themselves in new test contexts, improving coverage and increasing TBL base model performance by 1-2%.</Paragraph> </Section> </Section> <Section position="8" start_page="2" end_page="5" type="metho"> <SectionTitle> 5 Models for Classifier Combination </SectionTitle> <Paragraph position="0"> One necessary property for success in combining classifiers is that the errors produced by the component classifiers should not be positively correlated. On one extreme, if the classifier outputs are strongly correlated, they will have a very high inter-agreement rate and there is little to be gained from the joint output. On the other extreme, Perrone and Cooper (1993) show that, if the errors made by the classifiers are uncorrelated and unbiased, then by considering a classifier that selects the class that maximizes the posterior class probability average . This case is mostly of theoretical interest, since in practice all the classifiers will tend to make errors on the &quot;harder&quot; samples.</Paragraph> <Paragraph position="1"> Figure 3(a) shows the classifier inter-agreement among the six classifiers presented in Section 4, on the English data. Only two of them, BayesRatio and cosine, have an agreement rate of over 80% , while the agreement rate can be as low as 63% (BayesRatio and TBL). The average agreement is 71.7%. The fact that the classifiers' output are not strongly correlated suggests that the differences in performance among them can be systematically exploited to improve the overall classification. All individual classifiers have high stand-alone performance; each is individually competitive with the best single SENSEVAL2 systems and are fortuitously diverse in relative performance, as shown in Table 3(b). A dendogram of the similarity between the classifiers is shown in Figure 2, derived using maximum linkage hierarchical agglomerative clustering.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.1 Major Types of Classifier Combination </SectionTitle> <Paragraph position="0"> There are three major types of classifier combination (Xu et al., 1992). The most general type is the case where the classifiers output a posterior class probability distribution for each sample (which can be interpolated). In the second case, systems only output a set of labels, together with a ordering of preference (likelihood). In the third and most restrictive case, the classifications consist of just a single label, without rank or probability. Combining classifiers in each one of these cases has different properties; the remainder of this section examines models appropriate to each situation.</Paragraph> <Paragraph position="1"> The performance is measured using 5-fold cross validation on training data.</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 5.2 Combining the Posterior Sense Probability Distributions </SectionTitle> <Paragraph position="0"> One of the simplest ways to combine the posterior probability distributions is via direct averaging (Equation (1)). Surprisingly, this method obtains reasonably good results, despite its simplicity and the fact that is not theoretically motivated under a Bayes framework. Its success is highly dependent on the condition that the classifiers' errors are uncorrelated (Tumer and Gosh, 1995).</Paragraph> <Paragraph position="1"> The averaging method is a particular case of we obtain Equation (1).</Paragraph> <Paragraph position="2"> The mixture interpolation coefficients can be computed at different levels of granularity. For instance, one can make the assumption that C8 B4CZCYDCBNCSB5 BP C8 B4CZCYDCB5 and then the coefficients will be computed at word level; if C8 B4CZCYDCBNCSB5 BP C8 B4CZB5 then the coefficients will be estimated on the entire data.</Paragraph> <Paragraph position="3"> One way to estimate these parameters is by linear regression (Fuhr, 1989): estimate the coefficients that minimize the mean square error (MSE) where BV B4DCBNCSB5 is the target vector of the correct classification of word DC in document d: Note that we are computing a probability conditioned both on the target word DC and the document CS, because the documents are associated with a particular target word DC; this formalization works mainly for the lexical choice task.</Paragraph> <Paragraph position="5"> As shown in Fuhr (1989), Perrone and Cooper (1993), the solution to the optimization problem (3) can be obtained by solving a linear set of equations.</Paragraph> <Paragraph position="6"> The resulting classifier will have a lower square error than the average classifier (since the average classifier is a particular case of weighted mixture).</Paragraph> <Paragraph position="7"> Another common method to compute the AL parameters is by using the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). One can estimate the coefficients such as to maximize the log-likelihood of the data, C4 BP mization problem, the search space is convex, and therefore a solution exists and is unique, and it can be obtained by the usual EM algorithm (see Berger (1996) for a detailed description).</Paragraph> <Paragraph position="8"> An alternative method for estimating the parame- null therefore giving more weight to classifiers that have a smaller classification error (the method will be referred to as PB). The probabilities in Equation (4) are estimated directly from data, using the maximum likelihood principle.</Paragraph> </Section> <Section position="3" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.3 Combination based on Order Statistics </SectionTitle> <Paragraph position="0"> In cases where there are reasons to believe that the posterior probability distribution output by a classifier is poorly estimated , but that the relative ordering of senses matches the truth, a combination For instance, in sparse classification spaces, the Naive Bayes classifier will assign a probability very close to 1 to the most likely sense, and close to 0 for the other ones. strategy based on the relative ranking of sense posterior probabilities is more appropriate. The sense posterior probability can be computed as where the rank of a sense D7 is inversely proportional to the number of senses that are (strictly) more probable than sense D7: This method will tend to prefer senses that appear closer to the top of the likelihood list for most of the classifiers, therefore being more robust both in cases where one classifier makes a large error and in cases where some classifiers consistently overestimate the posterior sense probability of the most likely sense.</Paragraph> </Section> <Section position="4" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 5.4 The Classifier Republic: Voting </SectionTitle> <Paragraph position="0"> Some classification methods frequently used in NLP directly minimize the classification error and do not usually provide a probability distribution over classes/senses (e.g. TBL and decision lists).</Paragraph> <Paragraph position="1"> There are also situations where the user does not have access to the probability distribution, such as when the available classifier is a black-box that only outputs the best classification. A very common technique for combination in such a case is by voting (Brill and Wu, 1998; van Halteren et al., 1998; Sang et al., 2000). In the simplest model, each classifier votes for its classification and the sense that receives the most number of votes wins. The behavior is identical to selecting the sense with the highest posterior probability, computed as efficients can be either equal (in a perfect classifier democracy), or they can be estimated with any of the techniques presented in Section 5.2. Section 6 presents an empirical evaluation of these techniques. null Van Halteren et al. (1998) introduce a modified version of voting called TagPair. Under this model, the conditional probability that the word sense is D7 given that classifier CX outputs D7 Each classifier votes for its classification and every pair of classifiers votes for the sense that is most likely given the joint classification. In the experiments presented in van Halteren et al. (1998), this method was the best performer among the presented methods. Van Halteren et al. (2001) extend this method to arbitrarily long conditioning sequences, obtaining the best published POS tagging results on four corpora.</Paragraph> </Section> </Section> <Section position="9" start_page="5" end_page="9" type="metho"> <SectionTitle> 6 Empirical Evaluation </SectionTitle> <Paragraph position="0"> To empirically test the combination methods presented in the previous section, we ran experiments on the SENSEVAL1 English data and data from four SENSEVAL2 lexical sample tasks: English(EN), Spanish(ES), Basque(EU) and Swedish(SV). Unless explicitly stated otherwise, all the results in the following section were obtained by performing 5-fold cross-validation . To avoid the potential for over-optimization, a single final evaluation system was run once on the otherwise untouched test data, as presented in Section 6.3.</Paragraph> <Paragraph position="1"> The data consists of contexts associated with a specific word to be sense tagged (target word); the context size varies from 1 sentence (Spanish) to 5 sentences (English, Swedish). Table 1 presents some statistics collected on the training data for the five data sets. Some of the tasks are quite challenging (e.g. SENSEVAL2 English task) - as illustrated by the mean participating systems' accuracies in Table 5.</Paragraph> <Paragraph position="2"> Outlining the claim that feature selection is important for WSD, Table 2 presents the marginal loss in performance of either only using one of the positional feature classes or excluding one of the positional feature classes relative to the algorithm's full performance using all available feature classes. It is interesting to note that the feature-attractive methods (NB,BR,Cosine) depend heavily on the BagOfWords features, while discriminative methods are most dependent on LocalContext features. For an extensive evaluation of factors influencing the WSD performance (including representational features), we refer the readers to Yarowsky and Florian (2002).</Paragraph> <Section position="1" start_page="6" end_page="7" type="sub_section"> <SectionTitle> 6.1 Combination Performance </SectionTitle> <Paragraph position="0"> Table 3 shows the fine-grained sense accuracy (percent of exact correct senses) results of running the When parameters needed to be estimated, a 3-1-1 split was used: the systems were trained on three parts, parameters estimated on the fourth (in a round-robin fashion) and performance tested on the fifth; special care was taken such that no &quot;test&quot; data was used in training classifiers or parameter estimation. indicate that the difference in performance was not statistically significant at a BCBMBCBD level (paired McNemar test).</Paragraph> <Paragraph position="1"> classifier combination methods for 5 classifiers, NB (Naive Bayes), BR (BayesRatio), TBL, DL and MMVC, including the average classifier accuracy and the best classification accuracy. Before examining the results, it is worth mentioning that the methods which estimate parameters are doing so on a smaller training size (3/5, to be precise), and this can have an effect on how well the parameters are estimated. After the parameters are estimated, however, the interpolation is done between probability distributions that are computed on 4/5 of the training data, similarly to the methods that do not estimate any parameters.</Paragraph> <Paragraph position="2"> The unweighted averaging model of probability interpolation (Equation (1)) performs well, obtaining over 1% mean absolute performance over the best classifier , the difference in performance is statistically significant in all cases except Swedish and Spanish. Of the classifier combination techniques, rank-based combination and performance-based voting perform best. Their mean 2% absolute improvement over the single best classifier is significant in all languages. Also, their accuracy improvement relative to uniform-weight probability interpolation is statistically significant in aggregate and for all languages except Basque (where there is generally a small difference among all classifiers).</Paragraph> <Paragraph position="3"> To ensure that we benefit from the performance improvement of each of the stronger combination methods and also to increase robustness, a final averaging method is applied to the output of the best performing combiners (creating a stacked classifier). The last line in Table 3 shows the results obtained by averaging the rank-based, EM-vote and The best individual classifier differs with language, as shown in Figure 3(b).</Paragraph> <Paragraph position="4"> classifiers: NB, BR, TBL, DL, MMVC. Best performing methods are shown in bold.</Paragraph> <Paragraph position="5"> interpolation models for SENSEVAL2 PB-vote methods' output. The difference in performance between the stacked classifier and the best classifier is statistically significant for all data sets at a significance level of at least BDBC A0BH , as measured by a paired McNemar test.</Paragraph> <Paragraph position="6"> One interesting observation is that for all methods of AL-parameter estimation (EM, PB and uniform weighting) the count-based and rank-based strategies that ignore relative probability magnitudes out-perform their equivalent combination models using probability interpolation. This is especially the case when the base classifier scores have substantially different ranges or variances; using relative ranks effectively normalizes for such differences in model behavior.</Paragraph> <Paragraph position="7"> For the three methods that estimate the interpolation weights - MSE, EM and PB - three variants were investigated. These were distinguished by the granularity at which the weights are estimated: at word level (AL obtained by estimating the parameters using EM at different sample granularities for the SENSEVAL2 English data. The number in the last column is obtained by interpolating the first three systems. Also displayed is cross-entropy, a measure of how well</Paragraph> </Section> <Section position="2" start_page="7" end_page="8" type="sub_section"> <SectionTitle> 6.2 Individual Systems Contribution to </SectionTitle> <Paragraph position="0"> Combination An interesting issue pertaining to classifier combination is what is the marginal contribution to final combined performance of the individual classifier. A suitable measure of this contribution is the difference in performance between a combination system's behavior with and without the particular classifier. The more negative the accuracy difference on omission, the more valuable the classifier is to the ensemble system.</Paragraph> <Paragraph position="1"> Figure 4(a) displays the drop in performance obtained by eliminating in turn each classifier from the 6-way combination, across four languages, while Figure 4(b) shows the contribution of each classifier on the SENSEVAL2 English data for different training sizes (10%-80%) . Note that the classifiers with the greatest marginal contribution to the combined system performance are not always the best single performing classifiers (Table 3(b)), but those with the most effective original exploitation of the common feature space. On average, the classifier that contributes the most to the combined system's performance is the TBL classifier, with an average improvement of BCBMBIBIB1 across the 4 languages. Also, note that TBL and DL offer the greatest marginal contribution on smaller training sizes (Figure 4(b)).</Paragraph> </Section> <Section position="3" start_page="8" end_page="9" type="sub_section"> <SectionTitle> 6.3 Performance on Test Data </SectionTitle> <Paragraph position="0"> At all points in this article, experiments have been based strictly on the original SENSEVAL1 and SENSEVAL2 training sets via cross-validation. The official SENSEVAL1 and SENSEVAL2 test sets were The latter graph is obtained by sampling repeatedly a prespecified ratio of training samples from 3 of the 5 cross-validation splits, and testing on the other 2. unused and unexamined during experimentation to avoid any possibility of indirect optimization on this data. But to provide results more readily comparable to the official benchmarks, a single consensus system was created for each language using linear average stacking on the top three classifier combination methods in Table 3 for conservative robustness. The final frozen consensus system for each language was applied once to the SENSEVAL test sets. The fine-grained results are shown in Table 5. For each language, the single new stacked combination system outperforms the best previously reported SENSEVAL results on the identical test data .</Paragraph> <Paragraph position="1"> As far as we know, they represent the best published results for any of these five SENSEVAL tasks.</Paragraph> </Section> </Section> class="xml-element"></Paper>