Proceedings of EACL '99 
New Models for Improving Supertag Disambiguation 
John Chen* 
Department of Computer 
and Information Sciences 
University of Delaware 
Newark, DE 19716 
jchen@cis.udel.edu 
Srinivas Bangalore 
AT&T Labs Research 
180 Park Avenue 
P.O. Box 971 
Florham Park, NJ 07932 
srini@research.att.com 
K. Vijay-Shanker 
Department of Computer 
and Information Sciences 
University of Delaware 
Newark, DE 19716 
vijay~cis.udel.edu 
Abstract 
In previous work, supertag disambigua- 
tion has been presented as a robust, par- 
tial parsing technique. In this paper 
we present two approaches: contextual 
models, which exploit a variety of fea- 
tures in order to improve supertag per- 
formance, and class-based models, which 
assign sets of supertags to words in order 
to substantially improve accuracy with 
only a slight increase in ambiguity. 
1 Introduction 
Many natural language applications are beginning 
to exploit some underlying structure of the lan- 
guage. Roukos (1996) and Jurafsky et al. (1995) 
use structure-based language models in the 
context of speech applications. Grishman (1995) 
and Hobbs et al. (1995) use phrasal information 
in information extraction. Alshawi (1996) uses 
dependency information in a machine translation 
system. The need to impose structure leads to 
the need to have robust parsers. There have 
been two main robust parsing paradigms: Fi- 
nite State Grammar-based approaches (such 
as Abney (1990), Grishman (1995), and 
Hobbs et al. (1997)) and Statistical Parsing 
(such as Charniak (1996), Magerman (1995), and 
Collins (1996)). 
Srinivas (1997a) has presented a different ap- 
proach called supertagging that integrates linguis- 
tically motivated lexical descriptions with the ro- 
bustness of statistical techniques. The idea un- 
derlying the approach is that the computation 
of linguistic structure can be localized if lexical 
items are associated with rich descriptions (Su- 
pertags) that impose complex constraints in a lo- 
cal context. Supertag disambiguation is resolved 
"Supported by NSF grants ~SBR-9710411 and 
~GER-9354869 
by using statistical distributions of supertag co- 
occurrences collected from a corpus of parses. It 
results in a representation that is effectively a 
parse (almost parse). 
Supertagging has been found useful for a num- 
ber of applications. For instance, it can be 
used to speed up conventional chart parsers be- 
cause it reduces the ambiguity which a parser 
must face, as described in Srinivas (1997a). 
Chandrasekhar and Srinivas (1997) has shown 
that supertagging may be employed in informa- 
tion retrieval. Furthermore, given a sentence 
aligned parallel corpus of two languages and al- 
most parse information for the sentences of one 
of the languages, one can rapidly develop a gram- 
mar for the other language using supertagging, as 
suggested by Bangalore (1998). 
In contrast to the aforementioned work in su- 
pertag disambiguation, where the objective was 
to provide a-direct comparison between trigram 
models for part-of-speech tagging and supertag- 
ging, in this paper our goal is to improve the per- 
formance of supertagging using local techniques 
which avoid full parsing. These supertag disam- 
biguation models can be grouped into contextual 
models and class based models. Contextual mod- 
els use different features in frameworks that ex- 
ploit the information those features provide in 
order to achieve higher accuracies in supertag- 
ging. For class based models, supertags are first 
grouped into clusters and words are tagged with 
clusters of supertags. We develop several auto- 
mated clustering techniques. We then demon- 
strate that with a slight increase in supertag ambi- 
guity that supertagging accuracy can be substan- 
tially improved. 
The layout of the paper is as follows. In Sec- 
tion 2, we briefly review the task of supertagging 
and the results from previous work. In Section 3, 
we explore contextual models. In Section 4, we 
outline various class based approaches. Ideas for 
future work are presented in Section 5. Lastly, we 
188 
v 
Proceedings of EACL '99 
present our conclusions in Section 6. 
2 Supertagging 
Supertags, the primary elements of the LTAG 
formalism, attempt to localize dependencies, in- 
cluding long distance dependencies. This is ac- 
complished by grouping syntactically or semanti- 
cally dependent elements to be within the same 
structure. Thus, as seen in Figure 1, supertags 
contain more information than standard part-of- 
speech tags, and there are many more supertags 
per word than part-of-speech tags. In fact, su- 
pertag disambiguation may be characterized as 
providing an almost parse, as shown in the bottom 
part of Figure 1. 
Local statistical information, in the form of a 
trigram model based on the distribution of su- 
pertags in an LTAG parsed corpus, can be used 
to choose the most appropriate supertag for any 
given word. Joshi and Srinivas (1994) define su- 
pertagging as the process of assigning the best 
supertag to each word. Srinivas (1997b) and 
Srinivas (1997a) have tested the performance of a 
trigram model, typically used for part-of-speech 
tagging on supertagging, on restricted domains 
such as ATIS and less restricted domains such as 
Wall Street Journal (WSJ). 
In this work, we explore a variety of local 
techniques in order to improve the performance 
of supertagging. All of the models presented 
here perform smoothing using a Good-Turing dis- 
counting technique with Katz's backoff model. 
With exceptions where noted, our models were 
trained on one million words of Wall Street Jour- 
nal data and tested on 48K words. The data 
and evaluation procedure are similar to that used 
in Srinivas (1997b). The data was derived by 
mapping structural information from the Penn 
Treebank WSJ corpus into supertags from the 
XTAG grammar (The XTAG-Group (1995)) us- 
ing heuristics (Srinivas (1997a)). Using this data, 
the trigram model for supertagging achieves an 
accuracy of 91.37%, meaning that 91.37% of the 
words in the test corpus were assigned the correct 
supertag.1 
3 Contextual Models 
As noted in Srinivas (1997b), a trigram model of- 
ten fails to capture the cooccurrence dependencies 
1The supertagging accuracy of 92.2% reported 
in Srinivas (1997b) was based on a different supertag 
tagset; specifically, the supertag corpus was reanno- 
tated with detailed supertags for punctuation and 
with a different analysis for subordinating conjunc- 
tions. 
between a head and its dependents--dependents 
which might not appear within a trigram's window 
size. For example, in the sentence "Many Indians 
\]eared their country might split again" the pres- 
ence of might influences the choice of the supertag 
for \]eared, an influence that is not accounted for by 
the trigram model. As described below, we show 
that the introduction of features which take into 
account aspects of head-dependency relationships 
improves the accuracy of supertagging. 
3.1 One Pass Head Trigram Model 
In a head model, the prediction of the current su- 
pertag is conditioned not on the immediately pre- 
ceding two supertags, but on the supertags for the 
two previous head words. This model may thus 
be considered to be using a context of variable 
length. 2 The sentence "Many Indians feared their 
country might split again" shows a head model's 
strengths over the trigram model. There are at 
least two frequently assigned supertags for the 
word \]eared: a more frequent one corresponding 
to a subcategorization of NP object (as ~n of 
Figure 1) and a less frequent one to a S comple- 
ment. The supertag for the word might, highly 
probable to be modeled as an auxiliary verb in 
this case, provides strong evidence for the latter. 
Notice that might and \]eared appear within a head 
model's two head window, but not within the tri- 
gram model's two word window. We may there- 
fore expect that a head model would make a more 
accurate prediction. 
Srinivas (1997b) presents a two pass head tri- 
gram model. In the first pass, it tags words as 
either head words or non-head words. Training 
data for this pass is obtained using a head percola- 
tion table (Magerman (1995)) on bracketed Penn 
Treebank sentences. After training, head tagging 
is performed according to Equation 1, where 15 is 
the estimated probability and H(i) is a charac- 
teristic function which is true iff word i is a head 
word. 
n 
H ~ argmaxH H~(wilH(i))~(H(i)lH(i-1)H(i-2)) 
i=1 (1) 
The second pass then takes the words with this 
head information and supertags them according 
to Equation 2, where tH(io) is the supertag of the 
ePart of speech tagging models have not used heads 
in this manner to achieve variable length contexts. 
Variable length n-gram models, one of which is de- 
scribed in Niesler and Woodland (1996), have been 
used instead. 
189 
Proceedings of EACL '99 
NP A 
NP* S A 
NP VP 
V NP 
J J 
NP N 
D NP* N N* I I 
the pa~lmse h 
S S A A 
NP S 
NP NP VP V AP NP 
N ~ T NP ~ iA N 
price includes E ancillary companies 
ou 2 0 3 o~ 4 cc 5 
S S 
NP S NP S 
NP VP ~ NP VP 
~ V NP NP VP NP N 
N ~ V NP D NP* A N* E N I I 
ine/deslu I I price two ancillary companies 
°t6 c~7 h 134 cc8 
S 
NP S 
S NT VP /,~ 
NP N ~ VP ~ v Ap NP VP 
N N N* V NP ~ A V NP I I I / I 
purcha~ price includes ancillary companies 
• a9 1310 all a12 ct13 
i i i " 
s 
NP N NP NP VP NP N NP 
D NP* N N* N V NP D NP* A N ~ N I I I I I I I 
the purchase price includes two ancillary companies 
h h c¢2 C~ll ~3 ~4 a5 the purchase price includes two ancillary companies 
Figure 1: A selection of the supertags associated with each word of the sentence: the purchase price 
includes two ancillary companies 
jth head from word i. 
n 
T ,~ argmaxT ll g(wilti)~(tiItH(i,_HtH(i--2)) 
i=l (2) 
This model achieves an accuracy of 87%, lower 
than the trigram model's accuracy. 
Our current approach differs significantly. In- 
stead of having heads be defined through the use 
of the head percolation table on the Penn Tree- 
bank, we define headedness in terms of the su- 
pertags themselves. The set of supertags can nat- 
urally be partitioned into head and non-head su- 
pertags. Head supertags correspond to those that 
represent a predicate and its arguments, such as 
a3 and a7. Conversely, non-head supertags corre- 
spond to those supertags that represent modifiers 
or adjuncts, such as ~1 and ~2. 
Now, the tree that is assigned to a word during 
supertagging determines whether or not it is to 
be a head word. Thus, a simple adaptation of the 
Viterbi algorithm suffices to compute Equation 2 
in a single pass, yielding a one pass head trigram 
model. Using the same training and test data, the 
one pass head model achieved 90.75% accuracy, 
constituting a 28.8% reduction in error over the 
two pass head trigram model. This improvement 
may come from a reduction in error propagation 
or the richer context that is being used to predict 
heads. 
3.2 Mixed Head and Trigram Models 
The head mod.el skips words that it does not con- 
sider to be head words and hence may lose valu- 
able information. The lack of immediate local con- 
text hurts the head model in many cases, such as 
selection between head noun and noun modifier, 
and is a reason for its lower performance relative 
to the trigram model. Consider the phrase "..., 
or $ 2.48 a share." The word 2.48 may either be 
associated with a determiner phrase supertag (~1) 
or a noun phrase supertag (ag). Notice that 2.48 
is immediately preceded by $ which is extremely 
likely to be supertagged as a determiner phrase 
031). This is strong evidence that 2.48 should be 
supertagged as a9. A pure head model cannot 
consider this particular fact, however, because 131 
is not a head supertag. Thus, local context and 
long distance head dependency relationships are 
both important for accurate supertagging. 
A 5-gram mixed model that includes both the 
trigram and the head trigram context is one ap- 
proach to this problem. This model achieves a 
performance of 91.50%, an improvement over both 
190 
Proceedings of EACL '99 
Previous Current Next 
Context Supertag Context 
tH(i _2) tH(i _~) 
tH(i,_2) tH(i _~) 
tH(i,_2) tH(i,_~) 
tH(i _~) tLM(~ _~) 
tH(i,_l) tLM(i _l) 
tH(i.-l} tLM(i,-1) 
tH(i,o) 
tLM(~,o) 
tRM(I,o) 
tH(i,o) 
tLM(i,o) 
tRMii.o) 
tH(i, - * ) tH(i,o) 
tH(i _,) tLM(i,o) 
tH(i _2) tH(i _1) 
tH(i,_,) tH(i,o) 
tH(.,_ t) tLM(I,o) 
tH(i._ ~ ~ tRM(i,o) 
Table 1: In the 3-gram mixed model, previous con- 
ditioning context and the current supertag deter- 
ministically establish the next conditioning con- 
text. H, LM, and RM denote the entities head, 
left modifier, and right modifier, respectively. 
the trigram model and the head trigram model. 
We hypothesize that the improvement is limited 
because of a large increase in the number of pa- 
rameters to be estimated. 
As an alternative, we explore a 3-gram mixed 
model that incorporates nearly all of the relevant 
information. This mixed model may be described 
as follows. Recall that we partition the set of 
all supertags into heads and modifiers. Modifiers 
have been defined so as to share the characteristic 
that each one either modifies exactly one item to 
the right or one item to the left. Consequently, 
we further divide modifiers into left modifiers (134) 
and right modifiers. Now, instead of fixing the 
conditioning context to be either the two previous 
tags (as in the trigram model) or the two pre- 
vious head tags (as in the head trigram model) 
we allow it to vary according to the identity of 
the current tag and the previous conditioning con- 
text, as shown in Table 1. Intuitively, the mixed 
model is like the trigram model except that a mod- 
ifier tag is discarded from the conditioning context 
when it has found an object of modification. The 
mixed model achieves an accuracy of 91.79%, a 
significant improvement over both the head tri- 
gram model's and the trigram model's accuracies, 
p < 0.05. Furthermore, this mixed model is com- 
putationally more efficient as well as more accu- 
rate than the 5-gram model. 
3.3 Head Word Models 
Rather than head supertags, head words often 
seem to be more predictive of dependency rela- 
tions. Based upon this reflection, we have imple- 
mented models where head words have been used 
as features. The head word model predicts the cur- 
rent supertag based on two previous head words 
(backing off to their supertags) as shown in Equa- 
Model Context 
Trigram ti- 1 ti-2 
Head 
Trigram 
5-gram 
Mix 
3-gram 
Mix 
Head 
Word 
Mix 
Word 
tH(i,-1)tH(i,-2) 
ti-lti-2 
tH(i,--1)tH(i,-2) 
tcntzt(i,-1)tcntzt(i,-2) 
W(i,--1)W(i,-2) 
ti- 1 ti-2 
WH(i,-1)WH(i,-2) 
Accuracy 
91.37 
90.75 
91.50 
91.79 
88.16 
89.46 
Table 2: Single classifier contextual models that 
have been explored along with the contexts they 
consider and their accuracies 
tion 3. 
T ~ argmaxT rXP(wilti)p(ti\]WH(i,_l)WH(i,_2)) 
i=l (3) 
The mixed trigram and head word model takes into 
account local (supertag) context and long distance 
(head word) context. Both of these models ap- 
pear to suffer from severe sparse data problems. 
It is not surprising, then, that the head word 
model achieves an accuracy of only 88.16%, and 
the mixed trigram and head word model achieves 
an accuracy of 89.46%. We were only able to 
train the latter model with 250K of training data 
because of memory problems that were caused 
by computing the large parameter space of that 
model. 
The salient characteristics of models that have 
been discussed in this subsection are summarized 
in Table 2. 
3.4 Classifier Combination 
While the features that our new models have con- 
sidered are useful, an n-gram model that considers 
all of them would run into severe sparse data prob- 
lems. This difficulty may be surmounted through 
the use of more elaborate backoff techniques. On 
the other hand, we could consider using decision 
trees at choice points in order to decide which fea- 
tures are most relevant at each point. However, we 
have currently experimented with classifier combi- 
nation as a means of ameliorating the sparse data 
problem while making use of the feature combina- 
tions that we have introduced. 
In this approach, a selection of the discussed 
models is treated as a different classifier and is 
trained on the same data. Subsequently, each clas- 
sifter supertags the test corpus separately. Finally, 
191 
Proceedings of EACL '99 
Trigram Head Trigram Head Word 3-gram Mix Mix Word 
Trigram 91.37 91.87" 91.65 91.96 91.55 
Head Trigram 
Head Word 
3-gram Mix 
Mix Word 
90.75 90.96 
88.16 
91.95 
91.88 
91.79 
91.35" 
90.51" 
91.87 
89.46 
Table 3: Accuracies of Single Classifiers and Pairwise Combination of Classifiers. 
their predictions are combined using various vot- 
ing strategies. 
The same 1000K word test corpus is used in 
models of classifier combination as is used in pre- 
vious models. We created three distinct partitions 
of this 1000K word corpus, each partition consist- 
ing of a 900K word training corpus and a 100K 
word tune corpus. In this manner, we ended up 
with a total of 300K word tuning data. 
We consider three voting strategies suggested 
by van Halteren et al. (1998): equal vote, where 
each classifier's vote is weighted equally, overall 
accuracy, where the weight depends on the over- 
all accuracy of a classifier, and pair'wise voting. 
Pairwise voting works as follows. First, for each 
pair of classifiers a and b, the empirical prob- 
ability ~(tcorrectltctassilier_atclassiyier_b) is com- 
puted from tuning data, where tclassiyier-a and 
tct~ssiy~e~-b are classifier a's and classifier b's su- 
pertag assignment for a particular word respec- 
tively, and t .... ect is the correct supertag. Sub- 
sequently, on the test data, each classifier pair 
votes, weighted by overall accuracy, for the su- 
pertag with the highest empirical probability as 
determined in the previous step, given each indi- 
vidual classifier's guess. 
The results from these voting strategies are pos- 
itive. Equal vote yields an accuracy of 91.89%. 
Overall accuracy vote has an accuracy of 91:93%. 
Pairwise voting yields an accuracy of 92.19%, 
the highest supertagging accuracy that has been 
achieved, a 9.5% reduction in error over the tri- 
gram model. 
The table of accuracy of combinations of pairs 
of classifiers is shown in Table 3. 3 The effi- 
cacy of pairwise combination (which has signifi- 
cantly fewer parameters to estimate) in ameliorat- 
ing the sparse data problem can be seen clearly. 
For example, the accuracy of pairwise combina- 
tion of head classifier and trigram classifier ex- 
ceeds that of the 5-gram mixed model. It is also 
3Entries marked with an asterisk ("*") correspond 
to cases where the pairwise combination of classifiers 
was significantly better than either of their component 
classifiers, p < 0.05. 
marginally, but not significantly, higher than the 
3-gram mixed model. It is also notable that the 
pairwise combination of the head word classifier 
and the mix word classifier yields a significant im- 
provement over either classifier, p < 0.05, consid- 
ering the disparity between the accuracies of its 
component classifiers. 
3.5 Further Evaluation 
We also compare various models' performance 
on base-NP detection and PP attachment disam- 
biguation. The results will underscore the adroit- 
ness of the classifier combination model in using 
both local and long distance features. They will 
also show that, depending on the ultimate appli- 
cation, one model may be more appropriate than 
another model. 
A base-NP is a non-recursive NP structure 
whose detection is useful in many applications, 
such as information extraction. We extend our su- 
pertagging models to perform this task in a fash- 
ion similar to that described in Srinivas (1997b). 
Selected models have been trained on 200K words. 
Subsequently, after a model has supertagged the 
test corpus, a procedure detects base-NPs by scan- 
ning for appropriate sequences of supertags. Re- 
sults for base-NP detection are shown in Table 4. 
Note that the mixed model performs nearly as well 
as the trigram model. Note also that the head 
trigram model is outperformed by the other mod- 
els. We suspect that unlike the trigram model, the 
head model does not perform the accurate mod- 
eling of local context which is important for base- 
NP detection. 
In contrast, information about long distance de- 
pendencies are more important for the the PP at- 
tachment task. In this task, a model must de- 
cide whether a PP attaches at the NP or the VP 
level. This corresponds to a choice between two 
PP supertags: one associated with NP attach- 
ment, and another associated with VP attach- 
ment. The trigram model, head trigram model, 
3-gram mixed model, and classifier combination 
model perform at accuracies of 78.53%, 79.56%, 
80.16%, and 82.10%, respectively, on the PP at- 
192 
Proceedings of EACL '99 
Trigram 
3-gram Mix 
Head Trigram 
Classifier Combination 
Recall Precision 
93.75 93.00 
93.65 92.63 
91.17 89.72 
94.00 93.17 
Table 4: Some contextual models' results on base- 
NP chunking 
tachment task. As may be expected, the trigram 
model performs the worst on this task, presum- 
ably because it is restricted to considering purely 
local information. 
4 Class Based Models 
Contextual models tag each word with the sin- 
gle most appropriate supertag. In many applica- 
tions, however, it is sufficient to reduce ambiguity 
to a small number of supertags per word. For 
example, using traditional TAG parsing methods, 
such are described in Schabes (1990), it is ineffi- 
cient to parse with a large LTAG grammar for En- 
glish such as XTAG (The XTAG-Group (1995)). 
In these circumstances, a single word may be as- 
sociated with hundreds of supertags. Reducing 
ambiguity to some small number k, say k < 5 su- 
pertags per word 4 would accelerate parsing con- 
siderably. 5 As an alternative, once such a reduc- 
tion in ambiguity has been achieved, partial pars- 
ing or other techniques could be employed to iden- 
tify the best single supertag. These are the aims 
of class based models, which assign a small set of 
supertags to each word. It is related to work by 
Brown et al. (1992) where mutual information is 
used to cluster words into classes for language 
modeling. In our work with class based models, 
we have considered only trigram based approaches 
so far. 
4.1 Context Class Model 
One reason why the trigram model of supertag- 
ging is limited in its accuracy is because it con- 
siders only a small contextual window around 
the word to be supertagged when making its 
tagging decision. Instead of using this limited 
context to pinpoint the exact supertag, we pos- 
tulate that it may be used to predict certain 
4For example, the n-best model, described below, 
achieves 98.4% accuracy with on average 4.8 supertags 
per word. 
5An alternate approach to TAG parsing that ef- 
fectively shares the computation associated with each 
lexicalized elementary tree (supertag) is described in 
Evans and Weir (1998). It would be worth comparing 
both approaches. 
structural characteristics of the correct supertag 
with much higher accuracy. In the context class 
model, supertags that share the same character- 
istics are grouped into classes and these classes, 
rather than individual supertags, are predicted 
by a trigram model. This is reminiscent of 
Samuelsson and Reich (1999) where some part of 
speech tags have been compounded so that each 
word is deterministically in one class. 
The grouping procedure may be described as 
follows. Recall that each supertag corresponds to 
a lexicalized tree t E G, where G is a particu- 
lar LTAG. Using standard FIRST and FOLLOW 
techniques, we may associate t with FOLLOW 
and PRECEDE sets, FOLLOW(t) being the set 
of supertags that can immediately follow t and 
PRECEDE(t) being those supertags that can im- 
mediately precede t. For example, an NP tree such 
as 81 would be in the FOLLOW set of a supertag 
of a verb that subcategorizes for an NP comple- 
ment. We partition the set of all supertags into 
classes such that all of the supertags in a particu- 
lar class are associated with lexicalized trees with 
the same PRECEDE and FOLLOW sets. For in- 
stance, the supertags tx and t2 corresponding re- 
spectively to the NP and S subcategorizations of 
a verb \]eared would be associated with the same 
class. (Note that a head NP tree would be a mem- 
ber of both FOLLOW(t1) and FOLLOW(t2).) 
The context class model predicts sets of su- 
pertags for words as follows. First, the trigram 
model supertags each word wi with supertag ti 
that belongs to class Ci.6 Furthermore, using the 
training corpus, we obtain set D~ which contains 
all supertags t such that ~(wilt) > 0. The word 
wi is relabeled with the set of supertags C~ N Di. 
The context class model trades off an increased 
ambiguity of 1.65 supertags per word on average, 
for a higher 92.51% accuracy. For the purpose of 
comparison, we may compare this model against 
a baseline model that partitions the set of all su- 
pertags into classes so that all of the supertags in 
one class share the same preterminal symbol, i.e., 
they are anchored by words which share the same 
part of speech. With classes defined in this man- 
ner, call C~ the set of supertags that belong to 
the class which is associated with word w~ in the 
test corpus. We may then associate with word w~ 
the set of supertags C~ gl Di, where Di is defined 
as above. This baseline procedure yields an aver- 
6For class models, we have also exper- 
imented with a variant Where the classes 
are assigned to words through the model 
c ~ aTgmaxcl-I~=,~(w, IC~)~(C, IC~_lC,_2). In 
general, we have found this procedure to give slightly 
worse results. 
193 
Proceedings of EACL '99 
age ambiguity of 5.64 supertags per word with an 
accuracy of 97.96%. 
4.2 Confusion Class Model 
The confusion class model partitions supertags 
into classes according to an alternate procedure. 
Here, classes are derived from a confusion matrix 
analysis of errors which the trigram model makes 
while supertagging. First, the trigram model su- 
pertags a tune set. A confusion matrix is con- 
structed, recording the number of times supertag 
t~ was confused for supertag tj, or vice versa, 
in the tune set. Based on the top k pairs of 
supertags that are most confused, we construct 
classes of supertags that are confused with one 
another. For example, let tl and t2 be two PP 
supertags which modify an NP and VP respec- 
tively. The most common kind of mistake that 
the trigram model made on the tune data was to 
mistag tl as t2, and vice versa. Hence, tl and t2 
are clustered by our method into the same con- 
fusion class. The second most common mistake 
was to confuse supertags that represent verb mod- 
ifier PPs and those that represent verb argument 
PPs, while the third most common mistake was to 
confuse supertags that represent head nouns and 
noun modifiers. These, too, would form their own 
classes. 
The confusion class model predicts sets of su- 
pertags for words in a manner similar to the con- 
text class model. Unlike the context class model, 
however, in this model we have to choose k, the 
number of pairs of supertags which are extracted 
from the confusion matrix over which confusion 
classes are formed. In our experiments, we have 
found that with k = 10, k = 20, and k = 40, 
the resulting models attain 94.61% accuracy and 
1.86 tags per word, 95.76% accurate and 2.23 tags 
per word, and 97.03% accurate and 3.38 tags per 
word, respectively/ 
Results of these, as well as other models dis- 
cussed below, are plotted in Figure 2. The n-best 
model is a modification of the trigram model in 
which the n most probable supertags per word are 
chosen. The classifier union result is obtained by 
assigning a word wi a set of supertags til,.+. ,tik 
where to tij is the jth classifier's supertag assign- 
ment for word wl, the classifiers being the models 
discussed in Section 3. It achieves an accuracy of 
95.21% with 1.26 supertags per word. 
< 
980" 
99 0" 
96.0 " 
950 " 
94.0 " 
93.0" 
920" 
910" 
J / S 
I "P 3 
Ambiguity (Tags Per Word) 
0 Context 
CMss 
Confusion 
Class 
Classffmr 
Union 
-~(" N-Best 
Figure 2: Ambiguity versus Accuracy for Various 
Class Models 
5 Future Work 
We are considering extending our work in sev- 
eral directions. Srinivas (1997b) discussed a 
lightweight dependency analyzer which assigns de- 
pendencies assuming that each word has been as- 
signed a unique supertag. We are extending this 
algorithm to work with class based models which 
narrows down the number of supertags per word 
with much higher accuracy. Aside from the n- 
gram modeling that was a focus of this paper, 
we would also like to explore using other kinds 
of models, such as maximum entropy. 
6 Conclusions 
We have introduced two different kinds of models 
for the task of supertagging. Contextual mod- 
els show that features for accurate supertagging 
only produce improvements when they are appro- 
priately combined. Among these models were: a 
one pass head model that reduces propagation of 
head detection errors of previous models by using 
supertags themselves to identify heads; a mixed 
model that combines use of local and long distance 
information; and a classifier combination model 
that ameliorates the sparse data problem that is 
worsened by the introduction of many new fea- 
tures. These models achieve better supertagging 
accuracies than previously obtained. We have also 
introduced class based models which trade a slight 
increase in ambiguity for significantly higher accu- 
racy. Different class based methods are discussed, 
and the tradeoff between accuracy and ambiguity 
is demonstrated. 
7Again, for the class C assign to a given word w~, 
we consider only those tags ti E C for which/5(wdti) > 
0. 
References 
Steven Abney. 1990. Rapid Incremental parsing 
194 
Proceedings of EACL '99 
with repair. In Proceedings of the 6th New OED 
Conference: Electronic Text Research, pages 1- 
9, University of Waterloo, Waterloo, Canada. 
Hiyan Alshawi. 1996. Head automata and bilin- 
gual tiling: translation with minimal represen- 
tations. In Proceedings of the 34th Annual 
Meeting Association for Computational Lin- 
guistics, Santa Cruz, California. 
Srinivas Bangalore. 1998. Transplanting Su- 
pertags from English to Spanish. In Proceedings 
of the TAG+4 Workshop, Philadelphia, USA. 
Peter F. Brown, Vincent J. Della Pietra, Peter V. 
deSouza, Jennifer Lai, and Robert L. Mercer. 
1992. Class-based n-gram models of natural 
language Computational Linguistics, 18.4:467- 
479. 
R. Chandrasekhar and B. Srinivas. 1997. Using 
supertags in document filtering: the effect of 
increased context on information retrieval In 
Proceedings of Recent Advances in NLP '97. 
Eugene Charniak. 1996. Tree-bank Grammars. 
Technical Report CS-96-02, Brown University, 
Providence, RI. 
Michael Collins. 1996. A New Statistical Parser 
Based on Bigram Lexical Dependencies. In Pro- 
ceedings of the 3~ th Annual Meeting of the As- 
sociation for Computational Linguistics, Santa 
Cruz. 
Roger Evans and David Weir. 1998. A Structure- 
sharing Parser for Lexicalized Grammars. In 
Proceedings of the 17 eh International Confer- 
ence on Computational Linguistics and the 36 th 
Annual Meeting of the Association for Compu- 
tational Linguistics, Montreal. 
Ralph Grishman. 1995. Where's the Syntax? 
The New York University MUC-6 System. In 
Proceedings of the Sixth Message Understand- 
ing Conference, Columbia, Maryland. 
H. van Halteren, J. Zavrel, and W. Daelmans. 
1998. Improving Data Driven Wordctass Tag- 
ging by System Combination. In Proceedings of 
COLING-ACL 98, Montreal. 
Jerry R. Hobbs, Douglas E. Appelt, John 
Bear, David Israel, Andy Kehler, Megumi Ka- 
mayama, David Martin, Karen Myers, and 
Marby Tyson. 1995. SRI International FAS- 
TUS system MUC-6 test results and analy- 
sis. In Proceedings of the Sixth Message Un- 
derstanding Conference, Columbia, Maryland. 
Jerry R. Hobbs, Douglas Appelt, John Bear, 
David Israel, Megumi Kameyama, Mark Stickel, 
and Mabry Tyson. 1997. FASTUS: A Cas- 
caded Finite-State Transducer for Extracting 
Information from Natural-Language Text. In 
E. Roche and Schabes Y., editors, Finite State 
Devices for Natural Language Processing. MIT 
Press, Cambridge, Massachusetts. 
Aravind K. Joshi and B. Srinivas. 1994. Dis- 
ambiguation of Super Parts of Speech (or Su- 
pertags): Almost Parsing. In Proceedings of 
the 17 th International Conference on Com- 
putational Linguistics (COLING '9~), Kyoto, 
Japan, August. 
D. Jurafsky, Chuck Wooters, Jonathan Segal, An- 
dreas Stolcke, Eric Fosler, Gary Tajchman, and 
Nelson Morgan. 1995. Using a Stochastic CFG 
as a Language Model for Speech Recognition. 
In Proceedings, IEEE ICASSP, Detroit, Michi- 
gan. 
David M. Magerman. 1995. Statistical Decision- 
Tree Models for Parsing. In Proceedings of 
the 33 ~d Annual Meeting of the Association for 
Computational Linguistics. 
T.R. Niesler and P.C. Woodland. 1996. A 
variable-length category-based N-gram lan- 
guage model. In Proceedings, IEEE ICASSP. 
S. Roukos. 1996. Phrase structure language mod- 
els. In Proc. ICSLP '96, volume supplement, 
Philadelphia, PA, October. 
Christer Samuelsson and Wolfgang Reich. 1999. 
A Class-based Language Model for Large Vo- 
cabulary Speech Recognition Extracted from 
Part-of-Speech Statistics. In Proceedings, IEEE 
ICASSP. 
Yves Schabes. 1990. Mathematical and Computa- 
tional Aspects of Lexicalized Grammars. Ph.D. 
thesis, University of Pennsylvania, Philadel- 
phia, PA. 
B. Srinivas. 1997a. Complexity of Lexical De- 
scriptions and its Relevance to Partial Pars- 
ing. Ph.D. thesis, University of Pennsylvania, 
Philadelphia, PA, August. 
B. Srinivas. 1997b. Performance Evaluation of 
Supertagging for Partial Parsing. In Proceed- 
ings of Fifth International Workshop on Pars- 
ing Technology, Boston, USA, September. 
R. Weischedel., R. Schwartz, J. Palmucci, M. 
Meteer, and L. Ramshaw. 1993. Coping with 
ambiguity and unknown words through prob- 
abilistic models. Computational Linguistics, 
19.2:359-382. 
The XTAG-Group. 1995. A Lexicalized Tree Ad- 
joining Grammar for English. Technical Re- 
port IRCS 95-03, University of Pennsylvania, 
Philadelphia, PA. 
195 
