File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1025_metho.xml
Size: 24,401 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1025"> <Title>New Models for Improving Supertag Disambiguation</Title> <Section position="4" start_page="188" end_page="188" type="metho"> <SectionTitle> 2 Supertagging </SectionTitle> <Paragraph position="0"> Supertags, the primary elements of the LTAG formalism, attempt to localize dependencies, including long distance dependencies. This is accomplished by grouping syntactically or semantically dependent elements to be within the same structure. Thus, as seen in Figure 1, supertags contain more information than standard part-of-speech tags, and there are many more supertags per word than part-of-speech tags. In fact, supertag disambiguation may be characterized as providing an almost parse, as shown in the bottom part of Figure 1.</Paragraph> <Paragraph position="1"> Local statistical information, in the form of a trigram model based on the distribution of supertags in an LTAG parsed corpus, can be used to choose the most appropriate supertag for any given word. Joshi and Srinivas (1994) define supertagging as the process of assigning the best supertag to each word. Srinivas (1997b) and Srinivas (1997a) have tested the performance of a trigram model, typically used for part-of-speech tagging on supertagging, on restricted domains such as ATIS and less restricted domains such as Wall Street Journal (WSJ).</Paragraph> <Paragraph position="2"> In this work, we explore a variety of local techniques in order to improve the performance of supertagging. All of the models presented here perform smoothing using a Good-Turing discounting technique with Katz's backoff model.</Paragraph> <Paragraph position="3"> With exceptions where noted, our models were trained on one million words of Wall Street Journal data and tested on 48K words. The data and evaluation procedure are similar to that used in Srinivas (1997b). The data was derived by mapping structural information from the Penn Treebank WSJ corpus into supertags from the XTAG grammar (The XTAG-Group (1995)) using heuristics (Srinivas (1997a)). Using this data, the trigram model for supertagging achieves an accuracy of 91.37%, meaning that 91.37% of the words in the test corpus were assigned the correct supertag.1</Paragraph> </Section> <Section position="5" start_page="188" end_page="192" type="metho"> <SectionTitle> 3 Contextual Models </SectionTitle> <Paragraph position="0"> As noted in Srinivas (1997b), a trigram model often fails to capture the cooccurrence dependencies in Srinivas (1997b) was based on a different supertag tagset; specifically, the supertag corpus was reannotated with detailed supertags for punctuation and with a different analysis for subordinating conjunctions. null between a head and its dependents--dependents which might not appear within a trigram's window size. For example, in the sentence &quot;Many Indians \]eared their country might split again&quot; the presence of might influences the choice of the supertag for \]eared, an influence that is not accounted for by the trigram model. As described below, we show that the introduction of features which take into account aspects of head-dependency relationships improves the accuracy of supertagging.</Paragraph> <Section position="1" start_page="188" end_page="189" type="sub_section"> <SectionTitle> 3.1 One Pass Head Trigram Model </SectionTitle> <Paragraph position="0"> In a head model, the prediction of the current supertag is conditioned not on the immediately preceding two supertags, but on the supertags for the two previous head words. This model may thus be considered to be using a context of variable length. 2 The sentence &quot;Many Indians feared their country might split again&quot; shows a head model's strengths over the trigram model. There are at least two frequently assigned supertags for the word \]eared: a more frequent one corresponding to a subcategorization of NP object (as ~n of Figure 1) and a less frequent one to a S complement. The supertag for the word might, highly probable to be modeled as an auxiliary verb in this case, provides strong evidence for the latter.</Paragraph> <Paragraph position="1"> Notice that might and \]eared appear within a head model's two head window, but not within the tri-gram model's two word window. We may therefore expect that a head model would make a more accurate prediction.</Paragraph> <Paragraph position="2"> Srinivas (1997b) presents a two pass head tri-gram model. In the first pass, it tags words as either head words or non-head words. Training data for this pass is obtained using a head percolation table (Magerman (1995)) on bracketed Penn Treebank sentences. After training, head tagging is performed according to Equation 1, where 15 is the estimated probability and H(i) is a characteristic function which is true iff word i is a head word.</Paragraph> <Paragraph position="4"> The second pass then takes the words with this head information and supertags them according to Equation 2, where tH(io) is the supertag of the ePart of speech tagging models have not used heads in this manner to achieve variable length contexts.</Paragraph> <Paragraph position="5"> Variable length n-gram models, one of which is described in Niesler and Woodland (1996), have been used instead.</Paragraph> <Paragraph position="6"> jth head from word i.</Paragraph> <Paragraph position="8"> This model achieves an accuracy of 87%, lower than the trigram model's accuracy.</Paragraph> <Paragraph position="9"> Our current approach differs significantly. Instead of having heads be defined through the use of the head percolation table on the Penn Treebank, we define headedness in terms of the supertags themselves. The set of supertags can naturally be partitioned into head and non-head supertags. Head supertags correspond to those that represent a predicate and its arguments, such as a3 and a7. Conversely, non-head supertags correspond to those supertags that represent modifiers or adjuncts, such as ~1 and ~2.</Paragraph> <Paragraph position="10"> Now, the tree that is assigned to a word during supertagging determines whether or not it is to be a head word. Thus, a simple adaptation of the Viterbi algorithm suffices to compute Equation 2 in a single pass, yielding a one pass head trigram model. Using the same training and test data, the one pass head model achieved 90.75% accuracy, constituting a 28.8% reduction in error over the two pass head trigram model. This improvement may come from a reduction in error propagation or the richer context that is being used to predict heads.</Paragraph> </Section> <Section position="2" start_page="189" end_page="190" type="sub_section"> <SectionTitle> 3.2 Mixed Head and Trigram Models </SectionTitle> <Paragraph position="0"> The head mod.el skips words that it does not consider to be head words and hence may lose valuable information. The lack of immediate local context hurts the head model in many cases, such as selection between head noun and noun modifier, and is a reason for its lower performance relative to the trigram model. Consider the phrase &quot;..., or $ 2.48 a share.&quot; The word 2.48 may either be associated with a determiner phrase supertag (~1) or a noun phrase supertag (ag). Notice that 2.48 is immediately preceded by $ which is extremely likely to be supertagged as a determiner phrase 031). This is strong evidence that 2.48 should be supertagged as a9. A pure head model cannot consider this particular fact, however, because 131 is not a head supertag. Thus, local context and long distance head dependency relationships are both important for accurate supertagging.</Paragraph> <Paragraph position="1"> A 5-gram mixed model that includes both the trigram and the head trigram context is one approach to this problem. This model achieves a performance of 91.50%, an improvement over both</Paragraph> <Paragraph position="3"> ditioning context and the current supertag deterministically establish the next conditioning context. H, LM, and RM denote the entities head, left modifier, and right modifier, respectively.</Paragraph> <Paragraph position="4"> the trigram model and the head trigram model.</Paragraph> <Paragraph position="5"> We hypothesize that the improvement is limited because of a large increase in the number of parameters to be estimated.</Paragraph> <Paragraph position="6"> As an alternative, we explore a 3-gram mixed model that incorporates nearly all of the relevant information. This mixed model may be described as follows. Recall that we partition the set of all supertags into heads and modifiers. Modifiers have been defined so as to share the characteristic that each one either modifies exactly one item to the right or one item to the left. Consequently, we further divide modifiers into left modifiers (134) and right modifiers. Now, instead of fixing the conditioning context to be either the two previous tags (as in the trigram model) or the two previous head tags (as in the head trigram model) we allow it to vary according to the identity of the current tag and the previous conditioning context, as shown in Table 1. Intuitively, the mixed model is like the trigram model except that a modifier tag is discarded from the conditioning context when it has found an object of modification. The mixed model achieves an accuracy of 91.79%, a significant improvement over both the head tri-gram model's and the trigram model's accuracies, p < 0.05. Furthermore, this mixed model is computationally more efficient as well as more accurate than the 5-gram model.</Paragraph> </Section> <Section position="3" start_page="190" end_page="190" type="sub_section"> <SectionTitle> 3.3 Head Word Models </SectionTitle> <Paragraph position="0"> Rather than head supertags, head words often seem to be more predictive of dependency relations. Based upon this reflection, we have implemented models where head words have been used as features. The head word model predicts the current supertag based on two previous head words (backing off to their supertags) as shown in Equa-</Paragraph> <Paragraph position="2"> The mixed trigram and head word model takes into account local (supertag) context and long distance (head word) context. Both of these models appear to suffer from severe sparse data problems.</Paragraph> <Paragraph position="3"> It is not surprising, then, that the head word model achieves an accuracy of only 88.16%, and the mixed trigram and head word model achieves an accuracy of 89.46%. We were only able to train the latter model with 250K of training data because of memory problems that were caused by computing the large parameter space of that model.</Paragraph> <Paragraph position="4"> The salient characteristics of models that have been discussed in this subsection are summarized in Table 2.</Paragraph> </Section> <Section position="4" start_page="190" end_page="191" type="sub_section"> <SectionTitle> 3.4 Classifier Combination </SectionTitle> <Paragraph position="0"> While the features that our new models have considered are useful, an n-gram model that considers all of them would run into severe sparse data problems. This difficulty may be surmounted through the use of more elaborate backoff techniques. On the other hand, we could consider using decision trees at choice points in order to decide which features are most relevant at each point. However, we have currently experimented with classifier combination as a means of ameliorating the sparse data problem while making use of the feature combinations that we have introduced.</Paragraph> <Paragraph position="1"> In this approach, a selection of the discussed models is treated as a different classifier and is trained on the same data. Subsequently, each classifter supertags the test corpus separately. Finally, their predictions are combined using various voting strategies.</Paragraph> <Paragraph position="2"> The same 1000K word test corpus is used in models of classifier combination as is used in previous models. We created three distinct partitions of this 1000K word corpus, each partition consisting of a 900K word training corpus and a 100K word tune corpus. In this manner, we ended up with a total of 300K word tuning data.</Paragraph> <Paragraph position="3"> We consider three voting strategies suggested by van Halteren et al. (1998): equal vote, where each classifier's vote is weighted equally, overall accuracy, where the weight depends on the over-all accuracy of a classifier, and pair'wise voting. Pairwise voting works as follows. First, for each pair of classifiers a and b, the empirical probability ~(tcorrectltctassilier_atclassiyier_b) is computed from tuning data, where tclassiyier-a and tct~ssiy~e~-b are classifier a's and classifier b's supertag assignment for a particular word respectively, and t .... ect is the correct supertag. Subsequently, on the test data, each classifier pair votes, weighted by overall accuracy, for the supertag with the highest empirical probability as determined in the previous step, given each individual classifier's guess.</Paragraph> <Paragraph position="4"> The results from these voting strategies are positive. Equal vote yields an accuracy of 91.89%.</Paragraph> <Paragraph position="5"> Overall accuracy vote has an accuracy of 91:93%.</Paragraph> <Paragraph position="6"> Pairwise voting yields an accuracy of 92.19%, the highest supertagging accuracy that has been achieved, a 9.5% reduction in error over the tri-gram model.</Paragraph> <Paragraph position="7"> The table of accuracy of combinations of pairs of classifiers is shown in Table 3. 3 The efficacy of pairwise combination (which has significantly fewer parameters to estimate) in ameliorating the sparse data problem can be seen clearly. For example, the accuracy of pairwise combination of head classifier and trigram classifier exceeds that of the 5-gram mixed model. It is also 3Entries marked with an asterisk (&quot;*&quot;) correspond to cases where the pairwise combination of classifiers was significantly better than either of their component classifiers, p < 0.05.</Paragraph> <Paragraph position="8"> marginally, but not significantly, higher than the 3-gram mixed model. It is also notable that the pairwise combination of the head word classifier and the mix word classifier yields a significant improvement over either classifier, p < 0.05, considering the disparity between the accuracies of its component classifiers.</Paragraph> </Section> <Section position="5" start_page="191" end_page="192" type="sub_section"> <SectionTitle> 3.5 Further Evaluation </SectionTitle> <Paragraph position="0"> We also compare various models' performance on base-NP detection and PP attachment disambiguation. The results will underscore the adroitness of the classifier combination model in using both local and long distance features. They will also show that, depending on the ultimate application, one model may be more appropriate than another model.</Paragraph> <Paragraph position="1"> A base-NP is a non-recursive NP structure whose detection is useful in many applications, such as information extraction. We extend our supertagging models to perform this task in a fashion similar to that described in Srinivas (1997b). Selected models have been trained on 200K words.</Paragraph> <Paragraph position="2"> Subsequently, after a model has supertagged the test corpus, a procedure detects base-NPs by scanning for appropriate sequences of supertags. Results for base-NP detection are shown in Table 4.</Paragraph> <Paragraph position="3"> Note that the mixed model performs nearly as well as the trigram model. Note also that the head trigram model is outperformed by the other models. We suspect that unlike the trigram model, the head model does not perform the accurate modeling of local context which is important for base-NP detection.</Paragraph> <Paragraph position="4"> In contrast, information about long distance dependencies are more important for the the PP attachment task. In this task, a model must decide whether a PP attaches at the NP or the VP level. This corresponds to a choice between two PP supertags: one associated with NP attachment, and another associated with VP attachment. The trigram model, head trigram model, 3-gram mixed model, and classifier combination model perform at accuracies of 78.53%, 79.56%, 80.16%, and 82.10%, respectively, on the PP at-</Paragraph> </Section> </Section> <Section position="6" start_page="192" end_page="192" type="metho"> <SectionTitle> NP chunking </SectionTitle> <Paragraph position="0"> tachment task. As may be expected, the trigram model performs the worst on this task, presumably because it is restricted to considering purely local information.</Paragraph> </Section> <Section position="7" start_page="192" end_page="193" type="metho"> <SectionTitle> 4 Class Based Models </SectionTitle> <Paragraph position="0"> Contextual models tag each word with the single most appropriate supertag. In many applications, however, it is sufficient to reduce ambiguity to a small number of supertags per word. For example, using traditional TAG parsing methods, such are described in Schabes (1990), it is inefficient to parse with a large LTAG grammar for English such as XTAG (The XTAG-Group (1995)).</Paragraph> <Paragraph position="1"> In these circumstances, a single word may be associated with hundreds of supertags. Reducing ambiguity to some small number k, say k < 5 supertags per word 4 would accelerate parsing considerably. 5 As an alternative, once such a reduction in ambiguity has been achieved, partial parsing or other techniques could be employed to identify the best single supertag. These are the aims of class based models, which assign a small set of supertags to each word. It is related to work by Brown et al. (1992) where mutual information is used to cluster words into classes for language modeling. In our work with class based models, we have considered only trigram based approaches so far.</Paragraph> <Section position="1" start_page="192" end_page="193" type="sub_section"> <SectionTitle> 4.1 Context Class Model </SectionTitle> <Paragraph position="0"> One reason why the trigram model of supertagging is limited in its accuracy is because it considers only a small contextual window around the word to be supertagged when making its tagging decision. Instead of using this limited context to pinpoint the exact supertag, we postulate that it may be used to predict certain fectively shares the computation associated with each lexicalized elementary tree (supertag) is described in Evans and Weir (1998). It would be worth comparing both approaches.</Paragraph> <Paragraph position="1"> structural characteristics of the correct supertag with much higher accuracy. In the context class model, supertags that share the same characteristics are grouped into classes and these classes, rather than individual supertags, are predicted by a trigram model. This is reminiscent of Samuelsson and Reich (1999) where some part of speech tags have been compounded so that each word is deterministically in one class.</Paragraph> <Paragraph position="2"> The grouping procedure may be described as follows. Recall that each supertag corresponds to a lexicalized tree t E G, where G is a particular LTAG. Using standard FIRST and FOLLOW techniques, we may associate t with FOLLOW and PRECEDE sets, FOLLOW(t) being the set of supertags that can immediately follow t and PRECEDE(t) being those supertags that can immediately precede t. For example, an NP tree such as 81 would be in the FOLLOW set of a supertag of a verb that subcategorizes for an NP complement. We partition the set of all supertags into classes such that all of the supertags in a particular class are associated with lexicalized trees with the same PRECEDE and FOLLOW sets. For instance, the supertags tx and t2 corresponding respectively to the NP and S subcategorizations of a verb \]eared would be associated with the same class. (Note that a head NP tree would be a member of both FOLLOW(t1) and FOLLOW(t2).) The context class model predicts sets of supertags for words as follows. First, the trigram model supertags each word wi with supertag ti that belongs to class Ci.6 Furthermore, using the training corpus, we obtain set D~ which contains all supertags t such that ~(wilt) > 0. The word wi is relabeled with the set of supertags C~ N Di.</Paragraph> <Paragraph position="3"> The context class model trades off an increased ambiguity of 1.65 supertags per word on average, for a higher 92.51% accuracy. For the purpose of comparison, we may compare this model against a baseline model that partitions the set of all supertags into classes so that all of the supertags in one class share the same preterminal symbol, i.e., they are anchored by words which share the same part of speech. With classes defined in this manner, call C~ the set of supertags that belong to the class which is associated with word w~ in the test corpus. We may then associate with word w~ the set of supertags C~ gl Di, where Di is defined as above. This baseline procedure yields an aver- null Proceedings of EACL '99 age ambiguity of 5.64 supertags per word with an accuracy of 97.96%.</Paragraph> </Section> <Section position="2" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 4.2 Confusion Class Model </SectionTitle> <Paragraph position="0"> The confusion class model partitions supertags into classes according to an alternate procedure.</Paragraph> <Paragraph position="1"> Here, classes are derived from a confusion matrix analysis of errors which the trigram model makes while supertagging. First, the trigram model supertags a tune set. A confusion matrix is constructed, recording the number of times supertag t~ was confused for supertag tj, or vice versa, in the tune set. Based on the top k pairs of supertags that are most confused, we construct classes of supertags that are confused with one another. For example, let tl and t2 be two PP supertags which modify an NP and VP respectively. The most common kind of mistake that the trigram model made on the tune data was to mistag tl as t2, and vice versa. Hence, tl and t2 are clustered by our method into the same confusion class. The second most common mistake was to confuse supertags that represent verb modifier PPs and those that represent verb argument PPs, while the third most common mistake was to confuse supertags that represent head nouns and noun modifiers. These, too, would form their own classes.</Paragraph> <Paragraph position="2"> The confusion class model predicts sets of supertags for words in a manner similar to the context class model. Unlike the context class model, however, in this model we have to choose k, the number of pairs of supertags which are extracted from the confusion matrix over which confusion classes are formed. In our experiments, we have found that with k = 10, k = 20, and k = 40, the resulting models attain 94.61% accuracy and 1.86 tags per word, 95.76% accurate and 2.23 tags per word, and 97.03% accurate and 3.38 tags per word, respectively/ Results of these, as well as other models discussed below, are plotted in Figure 2. The n-best model is a modification of the trigram model in which the n most probable supertags per word are chosen. The classifier union result is obtained by assigning a word wi a set of supertags til,.+. ,tik where to tij is the jth classifier's supertag assignment for word wl, the classifiers being the models discussed in Section 3. It achieves an accuracy of</Paragraph> </Section> </Section> <Section position="8" start_page="193" end_page="193" type="metho"> <SectionTitle> Class Models 5 Future Work </SectionTitle> <Paragraph position="0"> We are considering extending our work in several directions. Srinivas (1997b) discussed a lightweight dependency analyzer which assigns dependencies assuming that each word has been assigned a unique supertag. We are extending this algorithm to work with class based models which narrows down the number of supertags per word with much higher accuracy. Aside from the n-gram modeling that was a focus of this paper, we would also like to explore using other kinds of models, such as maximum entropy.</Paragraph> </Section> class="xml-element"></Paper>