File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1029_metho.xml
Size: 14,799 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1029"> <Title>Classifier Combination for Improved Lexical Disambiguation</Title> <Section position="2" start_page="0" end_page="191" type="metho"> <SectionTitle> 1 Different Tagging Algorithms </SectionTitle> <Paragraph position="0"> The experiments described in this paper are based on four popular tagging algorithms, all of which have readily available implementations.</Paragraph> <Paragraph position="1"> These taggers are described below.</Paragraph> <Section position="1" start_page="0" end_page="191" type="sub_section"> <SectionTitle> 1.1 Unigram Tagging </SectionTitle> <Paragraph position="0"> This is by far the simplest of tagging algorithms.</Paragraph> <Paragraph position="1"> Every word is simply assigned its most likely part of speech, regardless of the context in which it appears. Surprisingly, this simple tagging method achieves fairly high accuracy.</Paragraph> <Paragraph position="2"> Accuracies of 90-94% are typical. In the unigram tagger used in our experiments, for words that do not appear in the lexicon we use a I See Dietterich(1997) for a good summary of these techniques.</Paragraph> <Paragraph position="3"> collection of simple manually-derived heuristics to guess the proper tag for the word.</Paragraph> </Section> <Section position="2" start_page="191" end_page="191" type="sub_section"> <SectionTitle> 1.2 N-Gram Tagging </SectionTitle> <Paragraph position="0"> N-gram part of speech taggers (Bahl(1976), Church(1992), Weischedel(1993)) are perhaps the most widely used of tagging algorithms.</Paragraph> <Paragraph position="1"> The basic model is that given a word sequence W, we try to find the tag sequence T that maximizes P(TIW). This can be done using the Viterbi algorithm to find the T that maximizes: P(T)*P(WIT). In our experiments, we use a standard trigram tagger using deleted interpolation (Jelinek (1980)) and used suffix information for handling unseen words (as was done in Weischedel (1993)).</Paragraph> </Section> <Section position="3" start_page="191" end_page="191" type="sub_section"> <SectionTitle> 1.3 Transformation-Based Tagging </SectionTitle> <Paragraph position="0"> In transformation-based tagging (Brill (1995)), every word is first assigned an initial tag, This tag is the most likely tag for a word if the word is known and is guessed based upon properties of the word if the word is not known. Then a sequence of rules are applied that change the tags of words based upon the contexts they appear in. These rules are applied deterministically, in the order they appear in the list. As a simple example, if race appears in the corpus most frequently as a noun, it will initially be mistagged as a noun in the sentence : We can race all day long.</Paragraph> <Paragraph position="1"> The rule Change a tag from NOUN to VERB if the previous tag is a MODAL would be applied to the sentence, resulting in the correct tagging. The environments used for changing a tag are the words and tags within a window of three words. For our experiments, we used a publicly available implementation of transformation-based tagging, 2 retrained on our training set.</Paragraph> <Paragraph position="2"> maximally agnostic with respect to all parameters for which no data exists. It is a nice framework for combining multiple constraints.</Paragraph> <Paragraph position="3"> Whereas the transformation-based tagger enforces multiple constraints by having multiple rules fire, the maximum-entropy tagger can have all of these constraints play a role at setting the probability estimates for the model's parameters. In Ratnaparkhi (1996), a maximum entropy tagger is presented. The tagger uses essentially the same parameters as the transformation-based tagger, but employs them in a different model.</Paragraph> <Paragraph position="4"> For our experiments, we used a publicly available implementation of maximum-entropy tagging) retrained on our training set.</Paragraph> </Section> </Section> <Section position="3" start_page="191" end_page="192" type="metho"> <SectionTitle> 2 Tagger Complementarity </SectionTitle> <Paragraph position="0"> All experiments presented in this paper were run on the Penn Treebank Wall Street Joumal corpus (Marcus (1993)). The corpus was divided into approximately 80% training and 20% testing, giving us approximately 1.1 million words of training data and 265,000 words of test data.</Paragraph> <Paragraph position="1"> The test set was not used in any way in training, so the test set does contain unknown words.</Paragraph> <Paragraph position="2"> In Figure 1 we show the relative accuracies of the four taggers. In parentheses we include tagger accuracy when only ambiguous and unknown words are considered? Next, we examine just how different the errors of the taggers are. We define the complementary rate of taggers A and B as :</Paragraph> <Section position="1" start_page="191" end_page="192" type="sub_section"> <SectionTitle> 1.4 Maximum-Entropy Tagging </SectionTitle> <Paragraph position="0"> The maximum-entropy framework is a probabilistic framework where a model is found that is consistent with the observed data and is ambiguity resolution over all words, including words that are unambiguous. Correctly tagging words that only can have one label contributes to the accuracy. We see in Figure 1 that when accuracy is measured on truly ambiguous words, the numbers are lower. In this paper we stick to the convention of giving results for all words, including unambiguous ones.</Paragraph> <Paragraph position="1"> # of common errors Comp(A, B) = (I ...... ) * I00 # of errors in A only In other words, Comp(A,B) measures the percentage of time when tagger A is wrong that tagger B is correct. In Figure 2 we show the complementary rates between the different taggers. For instance, when the maximum entropy tagger is wrong, the transformation-based tagger is right 37.7% of the time, and when the transformation-based tagger is wrong, the maximum entropy tagger is right 41.7% of the time.</Paragraph> <Paragraph position="2"> The complementary rates are quite high, which is encouraging, since this sets the upper bound on how well we can do in combining the different classifiers. If all taggers made the same errors, or if the errors that lower-accuracy taggers made were merely a superset of higheraccuracy tagger errors, then combination would be futile.</Paragraph> <Paragraph position="3"> In addition, a tagger is much more likely to have misclassified the tag for a word in instances where there is disagreement with at least one of the other classifiers than in the case where all classifiers agree. In Figure 3 we see, for instance that while the overall error rate for the Maximum Entropy tagger is 3.17%, in cases where there is disagreement between the four taggers the Maximum Entropy tagger error rate jumps to 27.1%. And discarding the unigram tagger, which is significantly less accurate than the others, when there is disagreement between the Maximum Entropy, Transformation-based and Trigram taggers, the Maximum Entropy tagger error rate jumps up to 43.7%. These cases account for 58% of the total errors the Maximum Entropy tagger makes (4833/8400).</Paragraph> <Paragraph position="4"> Next, we check whether tagger complementarity is additive. In Figure 4, the first row shows the additive error rate an oracle could achieve on the test set if the oracle could pick between the different outputs of the taggers. For example, when the oracle can examine the output of the Maximum Entropy, Transformation-Based and Trigram taggers, it could achieve an error rate of 1.62%. The second row shows the additive error rate reduction the oracle could achieve. If the oracle is allowed to choose between all four taggers, a 55.5% error rate reduction is obtained over the Maximum Entropy tagger error rate. If the unigram output is discarded, the oracle improvement drops down to 48.8% over Maximum Entropy tagger error rate.</Paragraph> <Paragraph position="5"> From these results, we can conclude that there is at least hope that improvments can be gained by combining the output of different taggers. We can also conclude that the improvements we expect are somewhat additive, meaning the more taggers we combine, the better results we should expect.</Paragraph> </Section> </Section> <Section position="4" start_page="192" end_page="194" type="metho"> <SectionTitle> 3 Tagger Combination </SectionTitle> <Paragraph position="0"> The fact that the errors the taggers make are strongly complementary is very encouraging. If all taggers made the exact same errors, there would obviously be no chance of improving accuracy through classifier combination.</Paragraph> <Paragraph position="1"> However, note that the high complementary rate between tagger errors in itself does not necessarily imply that there is anything to be gained by classifier combination.</Paragraph> <Paragraph position="2"> We ran experiments to determine whether the outputs of the different taggers could be effectively combined. We first explored combination via simple majority-wins voting. Next, we attempted to automatically acquire contextual cues that learned both which tagger to believe in which contexts and what tags are indicated by different patterns of tagger outputs. Both the word environments and the tagger outputs for the word being tagged and its neighbors are used as cues for predicting the proper tag.</Paragraph> <Section position="1" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 3.1 Simple Voting </SectionTitle> <Paragraph position="0"> The simplest combination scheme is to have the classifiers vote. The part of speech that appeared as the choice of the largest number of classifiers is picked as the answer, with some method being specified for breaking ties. We tried simple voting, using the Maximum Entropy, Transformation-Based and Trigram taggers. In case of ties (all taggers disagree), the Maximum Entropy tagger output is chosen, since this tagger had the highest overall accuracy (this was determined by using a subset of the training set, not by using the test set). The results are shown in Figure 5. Simple voting gives a net reduction in error of 6.9% over the best of the three taggers. This difference is significant at a >99% confidence level.</Paragraph> </Section> <Section position="2" start_page="193" end_page="194" type="sub_section"> <SectionTitle> 3.2 Contextual Cues </SectionTitle> <Paragraph position="0"> Next, we try to exploit the idiosyncracies of the different taggers. Although the Maximum Entropy, Transformation-based and Trigram taggers use essentially the same types of contextual information for disambignation, this information is exploited differently in each case. Our hope is that there is some regularity to these differences, which would then allow us to learn what conditions suggest that we should trust one tagger output over another.</Paragraph> <Paragraph position="1"> We used a version of example-based learning to determine whether these tagger differences could be exploited. 5 To determine 5 Example-based learning has also been applied succesfully in building a single part of speech tagger the tag of a word, we use the previous word, current word, next word, and the output of each tagger for the previous, current and next word.</Paragraph> <Paragraph position="2"> See Figure 6.</Paragraph> <Paragraph position="3"> wora. wo a, Word)., For each such context in the training set, we store the probabilities of what correct tags appeared in that context. When the tag distribution for a context has low entropy, it is a very good predictor of the correct tag when the identical environment occurs in unseen data.</Paragraph> <Paragraph position="4"> The problem is that these environments are very specific, and will have low overall recall in a novel corpus. To account for this, we must back off to more general contexts when we encounter an environment in the test set that did not occur in the training set. This is done by specifying an order in which fields should be ignored until a match is found. The back-off ordering is learned automatically.</Paragraph> <Paragraph position="5"> We ran two variants of this experiment.</Paragraph> <Paragraph position="6"> In the first case, given an instance in the test set, we find the most specific matching example in the training set, using the prespecified back-off ordering, and see what the most probable tag was in the training set for that environment.</Paragraph> <Paragraph position="7"> This is then chosen as the tag for the word. Note that this method is capable of learning to assign a tag that none of the taggers assigned. For instance, it could be the case that when the Unigram tagger thinks the tag should be X, and the Trigram and Maximum Entropy taggers think it should be Y, then the true tag is most frequently Z.</Paragraph> <Paragraph position="8"> In the second experiment, we use contexts to specify which tagger to trust, rather than which tag to output. Again the most specific context is found, but here we check which tagger has the highest probability of being correct in this particular context. For instance, we may learn that the Trigram tagger is most accurate at tagging the word up or that the Unigram tagger does best at tagging the word (Daelemans(1996)).</Paragraph> <Paragraph position="9"> race when the word that follows is and. The results are given in Figure 7. We see that while simple voting achieves an error reduction of 6.9%, using contexts to choose a tag gives an error reduction of 9.8% and using contexts to choose a tagger gives an error reduction of 10.4%.</Paragraph> <Paragraph position="10"> In this paper, we showed that the error distributions for three popular state of the art part of speech taggers are highly complementary. Next, we described experiments that demonstrated that we can exploit this complementarity to build a tagger that attains significantly higher accuracy than any of the individual taggers.</Paragraph> <Paragraph position="11"> In the future, we plan to expand our repertoire of base taggers, to determine whether performance continues to improve as we add additional systems. We also plan to explore different methods for combining classifier outputs. We suspect that the features we have chosen to use for combination are not the optimal set of features. We need to carefully study the different algorithms to find possible cues that can indicate where a particular tagger performs well. We hope that by following these general directions, we can further exploit differences in classifiers to improve accuracy in lexical disambiguation.</Paragraph> </Section> </Section> class="xml-element"></Paper>