File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/w96-0213_abstr.xml
Size: 22,249 bytes
Last Modified: 2025-10-06 13:48:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0213"> <Title>i A Maximum Entropy Model for Part-Of-Speech Tagging</Title> <Section position="2" start_page="0" end_page="140" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper presents a statistical model which trains from a corpus annotated with Part-Of-Speech tags and assigns them to previously unseen text with state-of-the-art accuracy(96.6%). The model can be classified as a Maximum Entropy model and simultaneously uses many contextual &quot;features&quot; to predict the POS tag. Furthermore, this paper demonstrates the use of specialized features to model difficult tagging decisions, discusses the corpus consistency problems discovered during the implementation of these features, and proposes a training strategy that mitigates these problems.</Paragraph> <Paragraph position="1"> Introduction Many natural language tasks require the accurate assignment of Part-Of-Speech (POS) tags to previously unseen text. Due to the availability of large corpora which have been manually annotated with POS information, many taggers use annotated text to &quot;learn&quot; either probability distributions or rules and use them to automatically assign POS tags to unseen text.</Paragraph> <Paragraph position="2"> The experiments in this paper were conducted on the Wall Street Journal corpus from the Penn Treebank project(Marcus et al., 1994), although the model can trai~n from any large corpus annotated with POS tags. Since most realistic natural language applications must process words that were never seen before in training data, all experiments in this paper are conducted on test data that include unknown words.</Paragraph> <Paragraph position="3"> Several recent papers(Brill, 1994, Magerman, 1995) have reported 96.5% tagging accuracy on the Wall St. Journal corpus.</Paragraph> <Paragraph position="4"> The experiments in this paper test the hypothesis that better use of context will improve the accuracy. A Maximum Entropy model is well-suited for such experiments since it cornbines diverse forms of contextual information in a principled manner, and does not impose any distributional assumptions on the training data. Previous uses of this model include language modeling(Lau et al., 1993), machine translation(Berger et al., 1996), prepositional phrase attachment(Ratnaparkhi et al., 1994), and word morphology(Della Pietra et al., 1995). This paper briefly describes the maximum entropy and maximum likelihood properties of the model, features used for POS tagging, and the experiments on the Penn Treebank Wall St. Journal corpus. It then discusses the consistency problems discovered during an attempt to use specialized features on the word context. Lastly, the results in this paper are compared to those from previous work on POS tagging.</Paragraph> <Paragraph position="5"> The Probability Model The probability model is defined over 7-/x 7-, where 7t is the set of possible word and tag contexts, or &quot;histories&quot;, and T is the set of allowable tags. The model's probability of a history h together with a tag t is defined as:</Paragraph> <Paragraph position="7"> where ~&quot; is a normalization constant, {tt, ~1,..., ak} are the positive model parameters and {fl,..-,fk} are known as &quot;features&quot;, where fj(h,t) E {O, 1}. Note that each parameter aj corresponds to a feature fj.</Paragraph> <Paragraph position="8"> Given a sequence of words {wl,..., Wn} and tags {tl,...t,~} as training data, define hi as the history available when predicting ti. The parameters {p, al ..... ak} are then chosen to maximize the likelihood of the training data using p:</Paragraph> <Paragraph position="10"> This model also can be interpreted under the Maximum Entropy formalism, in which the goal is to maximize the entropy of a distribution subject to certain constraints. Here, the entropy of the distribution p is defined as:</Paragraph> <Paragraph position="12"> where the model's feature expectation is</Paragraph> <Paragraph position="14"> and the observed feature expectation is and where iS(hi, ti) denotes the observed probability of (hi,ti) in the training data. Thus the constraints force the model to match its feature expectations with those observed in the training data. In practice, 7-/ is very large and the model's expectation Efj cannot be computed directly, so the following approximation(Lau et al., 1993) is used:</Paragraph> <Paragraph position="16"> where fi(hi) is the observed probability of the history hi in the training set.</Paragraph> <Paragraph position="17"> It can be shown (Darroch and Ratcliff, 1972) that if p has the form (1) and satisfies the k constraints (2), it uniquely maximizes the entropy H(p) over distributions that satisfy (2), and uniquely maximizes the likelihood L(p) over distributions of the form (1). The model parameters for the distribution p are obtained via Generalized</Paragraph> <Section position="1" start_page="133" end_page="136" type="sub_section"> <SectionTitle> Iterative Scaling(Darroch and Ratcliff, 1972). Features for POS Tagging </SectionTitle> <Paragraph position="0"> The joint probability of a history h and tag t is determined by those parameters whose corresponding features are active, i.e., those o~j such that fj(h,t) = 1. A feature, given (h,t), may activate on any word or tag in the history h, and must encode any information that might help predict t, such as the spelling of the current word, or the identity of the previous two tags. The specific word and tag context available to a feature is given in the following definition of a history hi:</Paragraph> <Paragraph position="2"> If the above feature exists in the feature set of the model, its corresponding model parameter will contribute towards the joint probability p(hi,ti) when wi ends with &quot;+-ng&quot; and when ti =VBG 1.</Paragraph> <Paragraph position="3"> Thus a model parameter aj effectively serves as a &quot;weight&quot; for a certain contextual predictor, in this case the suffix &quot;ing&quot;, towards the probability of observing a certain tag, in this case a VBG.</Paragraph> <Paragraph position="4"> The model generates the space of features by scanning each pair (hi, ti) in the training data with the feature &quot;templates&quot; given in Table 1. Given hi as the current history, a feature always asks some yes/no question about hi, and furthermore constrains ti to be a certain tag. The instantiations for the variables X, Y, and T in Table 1 are obtained automatically from the training data.</Paragraph> <Paragraph position="5"> The generation of features for tagging unknown words relies on the hypothesized distinction that &quot;rare&quot; words 2 in the training set are similar to unknown words in test data, with respect to how their spellings help predict their tags. The rare word features in Table 1, which look at the word spellings, will apply to both rare words and unknown words in test data.</Paragraph> <Paragraph position="6"> For example, Table 2 contains an excerpt from training data while Table 3 contains the features generated while scanning (ha, t3), in which the current word is about, and Table 4 contains features generated while scanning (h4, t4), in which the current word, well-heeled, occurs 3 times in training data and is therefore classified as &quot;rare&quot;. The behavior of a feature that occurs very sparsely in the training set is often difficult to predict, since its statistics may not be reliable. Therefore, the model uses the heuristic that any feature less than 5 times in the training set. The count of 5 was chosen by subjective inspection of words in the training data.</Paragraph> <Paragraph position="7"> Condition Features</Paragraph> <Paragraph position="9"/> <Paragraph position="11"/> <Paragraph position="13"> which occurs less than 10 times in the data is unreliable, and ignores features whose counts are less than 10. 3 While there are many smoothing algorithms which use techniques more rigorous than a simple count cutoff, they have not yet been investigated in conjunction with this tagger.</Paragraph> <Paragraph position="14"> Testing the Model The test corpus is tagged one sentence at a time.</Paragraph> <Paragraph position="15"> The testing procedure requires a search to enumerate the candidate tag sequences for the sentence, and the tag sequence with the highest probability is chosen as the answer.</Paragraph> <Paragraph position="16"> Search Algorithm The search algorithm, essentially a &quot;beam search&quot;, uses the conditional tag probability</Paragraph> <Paragraph position="18"> and maintains, as it sees a new word, the N highest probability tag sequence candidates up to that point in the sentence. Given a sentence a tag sequence candidate has conditional probability:</Paragraph> <Paragraph position="20"> In addition the search procedure optionally consults a Tag Dictionary, which, for each known word, lists the tags that it has appeared with in the training set. If the Tag Dictionary is in effect, the search procedure, for known words, generates only tags given by the dictionary entry, while for unknown words, generates all tags in the tag set.</Paragraph> <Paragraph position="21"> Without the Tag Dictionary, the search procedure generates all tags in the tag set for every word.</Paragraph> <Paragraph position="22"> Let W = {wl...w,~} be a test sentence, and let sij be the jth highest probability tag sequence up to and including word wi. The search is described below: 1. Generate tags for wl, find top N, set 81j , 1 _< j < N, accordingly.</Paragraph> <Paragraph position="23"> 2. Initialize i = 2 (a) Initialize j = 1 3Except for features that look only at the current word, i.e., features of the form wl ----<word> and tl :<TAG>. The count of 10 was chosen by inspection of Training and Development data.</Paragraph> <Paragraph position="24"> (b) Generate tags for wi, given s(i-1)j as previous tag context, and append each tag to s(i-1)j to make a new sequence (c) j = j + 1, Repeat from (b) ifj _< g 3. Find N highest probability sequences generated by above loop, and set sij, 1 < j _< N, accordingly. null 4. i = i + 1, Repeat from (a) if i _< n 5. Return highest probability sequence, s~l</Paragraph> </Section> <Section position="2" start_page="136" end_page="138" type="sub_section"> <SectionTitle> Experiments </SectionTitle> <Paragraph position="0"> In order to conduct tagging experiments, the Wall St. Journal data has been split into three contiguous sections, as shown in Table 5. The feature set and search algorithm were tested and debugged only on the Training and Development sets, and the official test result on the unseen Test set is presented in the conclusion of the paper.</Paragraph> <Paragraph position="1"> The performances of the &quot;baseline&quot; model on the Development Set, both with and without the Tag Dictionary, are shown in Table 6.</Paragraph> <Paragraph position="2"> All experiments use a beam size of N = 5; further increasing the beam size does not significantly increase performance on the Development Set but adversely affects the speed of the tagger.</Paragraph> <Paragraph position="3"> Even though use of the Tag Dictionary gave an apparently insignificant (. 12%) improvement in accuracy, it is used in further experiments since it significantly reduces the number of hypotheses and thus speeds up the tagger.</Paragraph> <Paragraph position="4"> The running time of the parameter estimation algorithm is O(NTA), where N is the training set size, T is the number of allowable tags, and A is the average number of features that are active for a given event (h, t). The running time of the search procedure on a sentence of length N is O(NTAB), where T, A are defined above, and B is the beam size. In practice, the model for the experiment shown in Table 6 requires approximately 24 hours to train, and 1 hour to test 4 on an IBM RS/6000 Model 380 with 256MB of RAM.</Paragraph> <Paragraph position="5"> The Maximum Entropy model allows arbitrary binary-valued features on the context, so it can use additional specialized, i.e., word-specific, features to correctly tag the &quot;residue&quot; that the baseline features cannot model. Since such features typically occur infrequently, the training set consistency must be good enough to yield reliable statistics. Otherwise the specialized features will model noise and perform poorly on test data.</Paragraph> <Paragraph position="6"> Such features can be designed for those words which are especially problematic for the model.</Paragraph> <Paragraph position="7"> The top errors of the model (over the training set) are shown in Table 7; clearly, the model has trouble with the words that and about, among others. As hypothesized in the introduction, better features on the context surrounding that and about should correct the tagging mistakes for these two words, assuming that the tagging errors are due to an impoverished feature set, and not inconsistent data.</Paragraph> <Paragraph position="8"> Specialized features for a given word are constructed by conjoining certain features in the base-line model with a question about the word itself. The features which ask about previous tags and surrounding words now additionally ask about the identity of the current word, e.g., a specialized feature for the word about in Table 3 could be:</Paragraph> <Paragraph position="10"> Table 8 shows the results of an experiment in which specialized features are constructed for &quot;difficult&quot; words, and are added to the baseline feature set. Here, &quot;difficult&quot; words are those that are mistagged a certain way at least 50 times when the training set is tagged with the baseline model.</Paragraph> <Paragraph position="11"> Using the set of 29 difficult words, the model performs at 96.49% accuracy on the Development Set, an insignificant improvement from the baseline accuracy of 96.43%. Table 9 shows the change in error rates on the Development Set for the frequently occurring &quot;difficult&quot; words. For most words, the specialized model yields little or no improvement, and for some, i.e., more and about, the specialized model performs worse.</Paragraph> <Paragraph position="12"> The lack of improvement implies that either the feature set is still impoverished, or that the training data is inconsistent. A simple consistency test is to graph the POS tag assignments for a given word as a function of the article in which it occurs. Consistently tagged words should have roughly the same tag distribution as the article numbers vary. Figure 1 represents each POS tag with a unique integer and graphs the POS annotation of about in the training set as a function of the articles (the points are &quot;scattered&quot; to show density). As seen in figure 1, about is usually annotated with tag#l, which denotes IN (preposition), or tag#9, which denotes RB (adverb), and the observed probability of either choice depends heavily on the current article-~. Upon further examination 5, the tagging distribution for about changes precisely when the annotator changes. Figure 2, which again uses integers to denote POS tags, shows the tag distribution of about as a function of annotator, and implies that the tagging errors for this word are due mostly to inconsistent data. The words ago, chief, down, executive, off, out, up and yen also exhibit similar bias.</Paragraph> <Paragraph position="13"> Thus specialized features may be less effective for those words affected by inter-annotator bias.</Paragraph> <Paragraph position="14"> A simple solution to eliminate inter-annotator inconsistency is to train and test the model on data that has been created by the same annotator. The results of such an experiment 6 are shown in Table 10. The total accuracy is higher, implying that the singly-annotated training and test sets are more consistent, and the improvement due to the specialized features is higher than before (.1%) but still modest, implying that either the features need further improvement or that intra-annotator inconsistencies exist in the corpus.</Paragraph> </Section> <Section position="3" start_page="138" end_page="140" type="sub_section"> <SectionTitle> Comparison With Previous Work </SectionTitle> <Paragraph position="0"> Most of the recent corpus-based POS taggers in the literature are either statistically based, and use Markov Model(Weischedel et al., 1993, Merialdo, 1994) or Statistical Decision Tree(Jelinek et al., 1994, Magerman, 1995)(SDT) techniques, or are primarily rule based, such as Drill's Transformation Based Learner(Drill, 1994)(TBL). The Maximum Entropy (MaxEnt) tagger presented in this paper combines the advantages of all these methods. It uses a rich feature representation, like TBL and SDT, and generates a tag probability distribution for each word, like Decision Tree and Markov Model techniques.</Paragraph> <Paragraph position="1"> by extracting those articles tagged by &quot;maryann&quot; in the Treebank v.5 CDROM. This training data does not overlap with the Development and Test set used in the paper. The single-annotator Development Set is the portion of the Development Set which has also been annotated by &quot;maryann&quot;. The word vocabulary and tag dictionary are the same as in the baseline experiment.</Paragraph> <Paragraph position="2"> Number of &quot;Difficult&quot; Words I Development Set Performance (Weischedel et al., 1993) provide the results from a battery of &quot;tri-tag&quot; Markov Model experiments, in which the probability P(W,T) of observing a word sequence W = {wl,w2,...,wn} together with a tag sequence T = {tl,t2,...,tn} is given by:</Paragraph> <Paragraph position="4"> Furthermore, p(wilti) for unknown words is computed by the following heuristic, which uses a set of 35 pre-determined endings:</Paragraph> <Paragraph position="6"> This approximation works as well as the MaxEnt model, giving 85% unknown word accuracy(Weischedel et al., 1993) on the Wall St. Journal, but cannot be generalized to handle more diverse information sources. Multiplying together all the probabilities becomes less convincing of an approximation as the information sources become less independent. In contrast, the Max-Ent model combines diverse and non-local information sources without making any independence assumptions.</Paragraph> <Paragraph position="7"> A POS tagger is one component in the SDT based statisticM parsing system described in (Jelinek et al., 1994, Magerman, 1995). The total word accuracy on Wall St. Journal data, 96.5%(Magerman, 1995), is similar to that presented in this paper. However, the aforementioned SDT techniques require word classes(Brown et al., 1992) to help prevent data fragmentation, and a sophisticated smoothing algorithm to mitigate the effects of any fragmentation that occurs. Unlike SDT, the MaxEnt training procedure does not recursively split the data, and hence does not suffer from unreliable counts due to data fragmentation. As a result, no word classes are required and a trivial count cutoff sufrices as a smoothing procedure in order to achieve roughly the same level of accuracy.</Paragraph> <Paragraph position="8"> TBL is a non-statistical approach to POS tagging which also uses a rich feature representation, and performs at a total word accuracy of 96.5% and an unknown word accuracy of 85%.(Bri11, 1994). The TBL representation of the surrounding word context is almost the same 7 and the TBL representation of unknown words is a superset s of the unknown word representation in this paper. However, since TBL is non-statistical, it does not provide probability distributions and 7(Brill, 1994) looks at words +-3 away from the current, whereas the feature set in this paper uses a window of +-2.</Paragraph> <Paragraph position="9"> 8(Brill, 1994) uses prefix/suffix additions and deletions, which are not used in this paper.</Paragraph> <Paragraph position="10"> ! unlike MaxEnt, cannot be used as a probabilistic component in a larger model. MaxEnt can provide a probability for each tagging decision, which can be used in the probability calculation of any structure that is predicted over the POS tags, such as noun phrases, or entire parse trees, as in (Jelinek et al., 1994, Magerman, 1995).</Paragraph> <Paragraph position="11"> Thus MaxEnt has at least one advantage over each of the reviewed POS tagging techniques. It is better able to use diverse information than Markov Models, requires less supporting techniques than SDT, and unlike TBL, can be used in a probabilistic framework. However, the POS tagging accuracy on the Penn Wall St. Journal corpus is roughly the same for all these modelling techniques. The convergence of the accuracy rate implies that either all these techniques are missing the right predictors in their representation to get the &quot;residue&quot;, or more likely, that any corpus based algorithm on the Penn Treebank Wall St. Journal corpus will not perform much higher than 96.5% due to consistency problems.</Paragraph> <Paragraph position="12"> Conclusion The Maximum Entropy model is an extremely flexible technique for linguistic modelling, since it can use a virtually unrestricted and rich feature set in the framework of a probability model. The implementation in this paper is a state-of-the-art POS tagger, as evidenced by the 96.6% accuracy on the unseen Test set, shown in Table 11.</Paragraph> <Paragraph position="13"> The model with specialized features does not perform much better than the baseline model, and further discovery or refinement of word-based features is difficult given the inconsistencies in the training data. A model trained and tested on data from a single annotator performs at .5% higher accuracy than the baseline model and should produce more consistent input for applications that require tagged text.</Paragraph> </Section> </Section> class="xml-element"></Paper>