A SNoW based Supertagger with Application to NP Chunking
Libin Shen and Aravind K. Joshi
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 19104, USA
a0 libin,joshi
a1 @linc.cis.upenn.edu
Abstract
Supertagging is the tagging process of
assigning the correct elementary tree of
LTAG, or the correct supertag, to each
word of an input sentence1. In this pa-
per we propose to use supertags to expose
syntactic dependencies which are unavail-
able with POS tags. We first propose a
novel method of applying Sparse Network
of Winnow (SNoW) to sequential models.
Then we use it to construct a supertagger
that uses long distance syntactical depen-
dencies, and the supertagger achieves an
accuracy of a2a4a3a6a5a8a7a10a9a12a11 . We apply the su-
pertagger to NP chunking. The use of su-
pertags in NP chunking gives rise to al-
most a9a12a11 absolute increase (from a2a4a3a6a5a14a13a16a15a16a11
to a2a4a3a6a5a17a2a4a18a16a11 ) in F-score under Transforma-
tion Based Learning(TBL) frame. The
surpertagger described here provides an
effective and efficient way to exploit syn-
tactic information.
1 Introduction
In Lexicalized Tree-Adjoining Grammar (LTAG)
(Joshi and Schabes, 1997; XTAG-Group, 2001),
each word in a sentence is associated with an el-
ementary tree, or a supertag (Joshi and Srinivas,
1994). Supertagging is the process of assigning the
correct supertag to each word of an input sentence.
The following two facts make supertagging attrac-
tive. Firstly supertags encode much more syntac-
tical information than POS tags, which makes su-
pertagging a useful pre-parsing tool, so-called, al-
most parsing (Srinivas and Joshi, 1999). On the
1By the correct supertag we mean the supertag that an LTAG
parser would assign to a word in a sentence.
other hand, as the term ’supertagging’ suggests, the
time complexity of supertagging is similar to that of
POS tagging, which is linear in the length of the in-
put sentence.
In this paper, we will focus on the NP chunk-
ing task, and use it as an application of supertag-
ging. (Abney, 1991) proposed a two-phase pars-
ing model which includes chunking and attaching.
(Ramshaw and Marcus, 1995) approached chuck-
ing by using Transformation Based Learning(TBL).
Many machine learning techniques have been suc-
cessfully applied to chunking tasks, such as Regular-
ized Winnow (Zhang et al., 2001), SVMs (Kudo and
Matsumoto, 2001), CRFs (Sha and Pereira, 2003),
Maximum Entropy Model (Collins, 2002), Memory
Based Learning (Sang, 2002) and SNoW (Mu˜noz et
al., 1999).
The previous best result on chunking in literature
was achieved by Regularized Winnow (Zhang et al.,
2001), which took some of the parsing results given
by an English Slot Grammar-based parser as input to
the chunker. The use of parsing results contributed
a13a19a5a17a20a4a3a16a11 absolute increase in F-score. However, this
approach conflicts with the purpose of chunking.
Ideally, a chunker geneates n-best results, and an at-
tacher uses chunking results to construct a parse.
The dilemma is that syntactic constraints are use-
ful in the chunking phase, but they are unavail-
able until the attaching phase. The reason is that
POS tags are not a good labeling system to encode
enough linguistic knowledge for chunking. How-
ever another labeling system, supertagging, can pro-
vide a great deal of syntactic information.
In an LTAG, each word is associated with a set of
possible elementary trees. An LTAG parser assigns
the correct elementary tree to each word of a sen-
tence, and uses the elementary trees of all the words
to build a parse tree for the sentence. Elementary
trees, which we call supertags, contain more infor-
mation than POS tags, and they help to improve the
chunking accuracy.
Although supertags are able to encode long dis-
tance dependence, supertaggers trained with local
information in fact do not take full advantage of
complex information available in supertags.
In order to exploit syntactic dependencies in a
larger context, we propose a new model of supertag-
ging based on Sparse Network of Winnow (SNoW)
(Roth, 1998). We also propose a novel method of
applying SNoW to sequential models in a way anal-
ogous to the Projection-base Markov Model (PMM)
used in (Punyakanok and Roth, 2000). In contrast to
PMM, we construct a SNoW classifier for each POS
tag. For each word of an input sentence, its POS tag,
instead of the supertag of the previous word, is used
to select the corresponding SNoW classifier. This
method helps to avoid the sparse data problem and
forces SNoW to focus on difficult cases in the con-
text of supertagging task. Since PMM suffers from
the label bias problem (Lafferty et al., 2001), we
have used two methods to cope with this problem.
One method is to skip the local normalization step,
and the other is to combine the results of left-to-right
scan and right-to-left scan.
We test our supertagger on both the hand-coded
supertags used in (Chen et al., 1999) as well as
the supertags extracted from Penn Treebank(PTB)
(Marcus et al., 1994; Xia, 2001). On the dataset
used in (Chen et al., 1999), our supertagger achieves
an accuracy of a2a4a3a6a5a8a7a10a9a12a11 .
We then apply our supertagger to NP chunking.
The purpose of this paper is to find a better way to
exploit syntactic information which is useful in NP
chunking, but not the machine learning part. So we
just use TBL, a well-known algorithm in the com-
munity of text chunking, as the machine learning
tool in our research. Using TBL also allows us to
easily evaluate the contribution of supertags with re-
spect to Ramshaw and Marcus’s original work, the
de facto baseline of NP chunking. The use of su-
pertags with TBL can be easily extended to other
machine learning algorithms.
We repeat Ramshaw and Marcus’ Transformation
Based NP chunking (Ramshaw and Marcus, 1995)
algorithm by substituting supertags for POS tags in
the dataset. The use of supertags gives rise to almost
a9a12a11 absolute increase (from a2a4a3a6a5a14a13a16a15a16a11 to a2a4a3a6a5a17a2a4a18a16a11 ) in F-
score under Transformation Based Learning(TBL)
frame. This confirms our claim that using supertag-
ging as a labeling system helps to increase the over-
all performance of NP Chunking. The supertag-
ger presented in this paper provides an opportunity
for advanced machine learning techniques to im-
prove their performance on chunking tasks by ex-
ploiting more syntactic information encoded in the
supertags.
2 Supertagging and NP Chunking
In (Srinivas, 1997) trigram models were used for su-
pertagging, in which Good-Turing discounting tech-
nique and Katz’s back-off model were employed.
The supertag for a word was determined by the lexi-
cal preference of the word, as well as by the contex-
tual preference of the previous two supertags. The
model was tested on WSJ section 20 of PTB, and
trained on section 0 through 24 except section 20.
The accuracy on the test data is a2a19a9a21a5a17a15a16a22a4a11 2.
In (Srinivas, 1997), supertagging was used for
NP chunking and it achieved an F-score of a2a4a3a6a5a8a7a23a11 .
(Chen, 2001) reported a similar result with a tri-
gram supertagger. In their approaches, they first su-
pertagged the test data and then uesd heuristic rules
to detect NP chunks. But it is hard to say whether
it is the use of supertags or the heuristic rules that
makes their system achieve the good results.
As a first attempt, we use fast TBL (Ngai and Flo-
rian, 2001), a TBL program, to repeat Ramshaw and
Marcus’ experiment on the standard dataset. Then
we use Srinivas’ supertagger (Srinivas, 1997) to su-
pertag both the training and test data. We run the
fast TBL for the second round by using supertags in-
stead of POS tags in the dataset. With POS tags we
achieve an F-score of a2a4a3a6a5a14a13a24a9a12a11 , but with supertags we
only achieve an F-score of a2a19a9a21a5a17a20a4a20a16a11 . This is not sur-
prising becuase Srinivas’ supertag was only trained
with a trigram model. Although supertags are able
to encode long distance dependence, supertaggers
trained with local information in fact do not take full
advantage of their strong capability. So we must use
long distance dependencies to train supertaggers to
take full advantage of the information in supertags.
2This number is based on footnote 1 of (Chen et al., 1999).
A few supertags were grouped into equivalence classes for eval-
uation
The trigram model often fails in capturing the co-
occurrence dependence between a head word and
its dependents. Consider the phrase ”will join the
board as a nonexecutive director”. The occurrence
of join has influence on the lexical selection of as.
But join is outside the window of trigram. (Srini-
vas, 1997) proposed a head trigram model in which
the lexical selection of a word depended on the su-
pertags of the previous two head words , instead of
the supertags of the two words immediately leading
the word of interest. But the performance of this
model was worse than the traditional trigram model
because it discarded local information.
(Chen et al., 1999) combined the traditional tri-
gram model and head trigram model in their trigram
mixed model. In their model, context for the current
word was determined by the supertag of the previ-
ous word and context for the previous word accord-
ing to 6 manually defined rules. The mixed model
achieved an accuracy of a2a19a9a21a5a25a22a21a2a16a11 on the same dataset
as that of (Srinivas, 1997). In (Chen et al., 1999),
three other models were proposed, but the mixed
model achieved the highest accuracy. In addition,
they combined all their models with pairwise voting,
yielding an accuracy of a2a4a3a6a5a26a9a27a2a16a11 .
The mixed trigram model achieves better results
on supertagging because it can capture both lo-
cal and long distance dependencies to some extent.
However, we think that a better way to find useful
context is to use machine learning techniques but not
define the rules manually. One approach is to switch
to models like PMM, which can not only take advan-
tage of generative models with the Viterbi algorithm,
but also utilize the information in a larger contexts
through flexible feature sets. This is the basic idea
guiding the design of our supertagger.
3 SNoW
Sparse Network of Winnow (SNoW) (Roth, 1998) is
a learning architecture that is specially tailored for
learning in the presence of a very large number of
features where the decision for a single sample de-
pends on only a small number of features. Further-
more, SNoW can also be used as a general purpose
multi-class classifier.
It is noted in (Mu˜noz et al., 1999) that one of
the important properites of the sparse architecture of
SNoW is that the complexity of processing an exam-
ple depends only on the number of features active in
it, a28a30a29 , and is independent of the total number of fea-
tures, a28a32a31 , observed over the life time of the system
and this is important in domains in which the total
number of features in very large, but only a small
number of them is active in each example.
As far as supertagging is concerned, word context
forms a very large space. However, for each word in
a given sentence, only a small part of features in the
space are related to the decision on supertag. Specif-
ically the supertag of a word is determined by the ap-
pearances of certain words, POS tags, or supertags
in its context. Therefore SNoW is suitable for the
supertagging task.
Supertagging can be viewed in term of the se-
quential model, which means that the selection of
the supertag for a word is influenced by the decisions
made on the previous few words. (Punyakanok and
Roth, 2000) proposed three methods of using classi-
fiers in sequential inference, which are HMM, PMM
and CSCL. Among these three models, PMM is the
most suitable for our task. The basic idea of PMM
is as follows.
Given an observation sequence a33 , we find the
most likely state sequence a34 given a33 by maximiz-
ing
a35a37a36
a34a39a38a8a33a41a40a43a42 a44
a45
a46
a31a48a47a50a49
a35a37a36a52a51
a31a53a38
a51a16a54a27a55
a5a56a5a56a5
a55a53a51
a31a58a57
a54a59a55
a33a60a40a62a61
a35a63a54a59a36a52a51a4a54
a38a64
a54
a40
a42 a44
a45
a46
a31a48a47a50a49
a35a37a36a52a51
a31a53a38
a51
a31a58a57
a54a27a55
a64a65a31a66a40a62a61
a35a67a54a65a36a52a51a16a54
a38a64
a54
a40 (1)
In this model, the output of SNoW is used to es-
timate a35a37a36a52a51 a38
a51a27a68a69a55
a64a16a40 and a35a67a54a59a36a52a51 a38a64a16a40 , where a51 is the current
state, a51a27a68 is the previous state, and a64 is the current
observation. a35a37a36a52a51 a38a51 a68 a55 a64a16a40 is separated to many sub-
functions a35a71a70a62a72a66a36a52a51 a38a64a4a40 according to previous state a51a12a68 . In
practice, a35a73a70 a72 a36a52a51 a38a64a16a40 is estimated in a wider window
of the observed sequence, instead of a64 only. Then
the problem is how to map the SNoW results into
probabilities. In (Punyakanok and Roth, 2000), the
sigmoid a9a12a74 a36 a9a63a75a77a76 a57a32a78a79a29a53a80a52a31a58a57a24a81a50a82 a40 is defined as confidence,
where a83 is the threshold for SNoW, a84a6a85a87a86 is the dot
product of the weight vector and the example vec-
tor. The confidence is normalized by summing to 1
and used as the distribution mass a35a73a70a88a72a89a36a52a51 a38a64a16a40 .
4 Modeling Supertagging
4.1 A Novel Sequential Model with SNoW
Firstly we have to decide how to treat POS tags. One
approach is to assign POS tags at the same time that
we do supertagging. The other approach is to as-
sign POS tags with a traditional POS tagger first,
and then use them as input to the supertagger. Su-
pertagging an unknown word becomes a problem for
supertagging due to the huge size of the supertag set,
Hence we use the second approach in our paper. We
first run the Brill POS tagger (Brill, 1995) on both
the training and the test data, and use POS tags as
part of the input.
Let a90 a42 a91 a54 a91a39a49a65a5a56a5a56a5a8a91 a45 be the sentence, a92 a42
a93
a54
a93
a49a65a5a56a5a56a5
a93
a45 be the POS tags, and
a83a94a42a95a86
a54
a86a96a49a65a5a56a5a56a5a8a86
a45 be
the supertags respectively. Given a90
a55
a92 , we can find
the most likely supertag sequence a83 given a90 a55 a92 by
maximizing
a35a37a36
a83a97a38a8a90
a55
a92a60a40a98a42a99a44
a45
a46
a100
a47a50a49
a35a37a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a103a55
a90
a55
a92a60a40a62a61
a35a63a54a65a36
a86
a54
a38a91
a54a27a55
a93
a54
a40
Analogous to PMM, we decompose
a35a37a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a104a55
a90
a55
a92a41a40 into sub-classifiers. How-
ever, in our model, we divide it with respect to POS
tags as follows
a35a105a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a103a55
a90
a55
a92a60a40a63a106
a35a108a107a66a109a110a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a103a55
a90
a55
a92a60a40 (2)
There are several reasons for decomposing
a35a37a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a104a55
a90
a55
a92a41a40 with respect to the POS tag of
the current word, instead of the supertag of the pre-
vious word.
a111 To avoid sparse-data problem. There are 479
supertags in the set of hand-coded supertags,
and almost 3000 supertags in the set of su-
pertags extracted from Penn Treebank.
a111 Supertags related to the same POS tag are more
difficult to distinguish than supertags related to
different POS tags. Thus by defining a clas-
sifier on the POS tag of the current word but
not the POS tag of the previous word forces the
learning algorithm to focus on difficult cases.
a111 Decomposition of the probability estimation
can decrease the complexity of the learning al-
gorithm and allows the use of different param-
eters for different POS tags.
For each POS a93 , we construct a SNoW classifier
a112a113a107 to estimate distribution a35a108a107a21a36
a86a27a38a86
a68a58a55
a90
a55
a92a60a40 accord-
ing to the previous supertags a86 a68 . Following the esti-
mation of distribution function in (Punyakanok and
Roth, 2000), we define confidence with a sigmoid
a114 a107a21a36
a86a104a38a86
a68 a55
a90
a55
a92a41a40a67a106
a9
a9a98a75a116a115a117a76
a57a32a78a79a118a120a119a121a78a122a31a88a123a31
a72a69a124a125a126a124a127
a82a58a57
a70
a82
a55 (3)
where a51 is the threshold of a112a113a107 , and a115 is set to 1.
The distribution mass is then defined with normal-
ized confidence
a35a71a107a65a36
a86a27a38a86
a68 a55
a90
a55
a92a60a40a67a106
a114 a107 a36
a86a27a38a86
a68a69a55
a90
a55
a92a60a40
a128
a31
a114 a107a4a36
a86a104a38a86
a68
a55
a90
a55
a92a41a40
(4)
4.2 Label Bias Problem
In (Lafferty et al., 2001), it is shown that PMM
and other non-generative finite-state models based
on next-state classifiers share a weakness which they
called the label bias problem: the transitions leaving
a given state compete only against each other, rather
than against all other transitions in the model. They
proposed Conditional Random Fields (CRFs) as so-
lution to this problem.
(Collins, 2002) proposed a new algorithm for pa-
rameter estimation as an alternate to CRF. The new
algorithm was similar to maximum-entropy model
except that it skipped the local normalization step.
Intuitively, it is the local normalization that makes
distribution mass of the transitions leaving a given
state incomparable with all other transitions.
It is noted in (Mu˜noz et al., 1999) that SNoW’s
output provides, in addition to the prediction, a ro-
bust confidence level in the prediction, which en-
ables its use in an inference algorithm that combines
predictors to produce a coherent inference. In that
paper, SNoW’s output is used to estimate the proba-
bility of open and close tags. In general, the proba-
bility of a tag can be estimated as follows
a35 a107 a36
a86a27a38a86
a68 a55
a90
a55
a92a60a40a63a106
a112a129a107a21a36
a86a104a38a86
a68 a55
a90
a55
a92a41a40a131a130
a51
a128
a31
a36a52a112a113a107a4a36
a86a104a38a86
a68 a55
a90
a55
a92a41a40a73a130
a51
a40
a55 (5)
as one of the anonymous reviewers has suggested.
However, this makes probabilities comparable
only within the transitions of the same history a86 a68 .
An alternative to this approach is to use the SNoW’s
output directly in the prediction combination, which
makes transitions of different history comparable,
since the SNoW’s output provides a robust confi-
dence level in the prediction. Furthermore, in order
to make sure that the confidences are not too sharp,
we use the confidence defined in (3).
In addition, we use two supertaggers, one scans
from left to right and the other scans from right to
left. Then we combine the results via pairwise vot-
ing as in (van Halteren et al., 1998; Chen et al.,
1999) as the final supertag. This approach of vot-
ing also helps to cope with the label bias problem.
4.3 Contextual Model
a35a71a107a65a36
a86a27a38a86
a68 a55
a90
a55
a92a60a40 is estimated within a 5-word window
plus two head supertags before the current word.
For each word a91 a100 , the basic features are a90 a42
a91
a100
a57a132a49
a124a101a8a101a8a101a124
a100a26a133
a49 , a92 a42
a93
a100
a57a132a49
a124a101a8a101a8a101a124
a100a26a133
a49 , a86
a68
a42 a86
a100
a57a132a49
a124
a100
a57
a54 and
a134a10a135
a57a132a49
a124
a57
a54 , the two head supertags before the current
word. Thus
a35a71a107a62a109a110a36
a86
a100
a38a86
a54a102a101a8a101a8a101
a100
a57
a54a104a55
a90
a55
a92a41a40
a42
a35a71a107a62a109a110a36
a86
a100
a38a86
a100
a57a132a49
a124
a100
a57
a54a27a55
a91
a100
a57a132a49
a101a8a101a8a101
a100a56a133
a49
a55
a93
a100
a57a132a49
a101a8a101a8a101
a100a26a133
a49
a55 a134a10a135
a57a132a49
a124
a57
a54
a40
A basic feature is called active for word a91 a100 if and
only if the corresponding word/POS-tag/supertag
appears at a specified place around a91
a100 . For our
SNoW classifiers we use unigram and bigram of ba-
sic features as our feature set. A feature defined as a
bigram of two basic features is active if and only if
the two basic features are both active. The value of
a feature of a91 a100 is set to 1 if this feature is active for
a91
a100 , or 0 otherwise.
4.4 Related Work
(Chen, 2001) implemented an MEMM model for su-
pertagging which is analogous to the POS tagging
model of (Ratnaparkhi, 1996). The feature sets used
in the MEMM model were similar to ours. In addi-
tion, prefix and suffix features were used to handle
rare words. Several MEMM supertaggers were im-
plemented based on distinct feature sets.
In (Mu˜noz et al., 1999), SNoW was used for
text chunking. The IOB tagging model in that pa-
per was similar to our model for supertagging, but
there are some differences. They did not decom-
pose the SNoW classifier with respect to POS tags.
They used two-level deterministic ( beam-width=1 )
search, in which the second level IOB classifier takes
the IOB output of the first classifier as input features.
5 Experimental Evaluation and Analysis
In our experiments, we use the default settings of
the SNoW promotion parameter, demotion parame-
ter and the threshold value given by the SNoW sys-
tem. We train our model on the training data for 2
rounds, only counting the features that appear for at
least 5 times. We skip the normalization step in test,
and we use beam search with the width of 5.
In our first experiment, we use the same dataset
as that of (Chen et al., 1999) for our experiments.
We use WSJ section 00 through 24 expect section
20 as training data, and use section 20 as test data.
Both training and test data are first tagged by Brill’s
POS tagger (Brill, 1995). We use the same pair-
wise voting algorithm as in (Chen et al., 1999). We
run supertagging on the training data and use the su-
pertagging result to generate the mapping table used
in pairwise voting.
The SNoW supertagger scanning from left to
right achieves an accuracy of a2a4a3a6a5a14a13a16a3a16a11 , and the one
scanning from right to left achieves an accuracy of
a2a19a9a21a5a8a7a136a15a16a11 . By combining the results of these two su-
pertaggers with pairwise voting, we achieve an ac-
curacy of a2a4a3a6a5a8a7a10a9a12a11 , an error reduction of a3a6a5a26a9a12a11 com-
pared to a2a4a3a6a5a17a3a4a18a16a11 , the best supertagging result to date
(Chen, 2001). Table 1 shows the comparison with
previous work.
Our algorithm, which is coded in Java, takes
about 10 minutes to supertag the test data with a
P3 1.13GHz processor. However, in (Chen, 2001),
the accuracy of a2a4a3a6a5a17a3a4a18a16a11 was achieved by a Viterbi
search program that took about 5 days to supertag
the test data. The counterpart of our algorithm in
(Chen, 2001) is the beam search on Model 8 with
width of 5, which is the same as the beam width in
our algorithm. Compared with this program, our al-
gorithm achieves an error reduction of a22a23a5a26a9a12a11 .
(Chen et al., 1999) achieved an accuracy of
a2a4a3a6a5a26a9a27a2a16a11 by combination of 5 distinct supertaggers.
However, our result is achieved by combining out-
puts of two homogeneous supertaggers, which only
differ in scan direction.
Our next experiment is with the set of supertags
abstracted from PTB with Fei Xia’s LexTract (Xia,
2001). Xia extracted an LTAG-style grammar from
PTB, and repeated Srinivas’ experiment (Srinivas,
1997) on her supertag set. There are 2920 elemen-
model acca11
Srinivas(97) trigram 91.37
Chen(99) trigram mix 91.79
Chen(99) voting 92.19
Chen(01) width=5 91.83
Chen(01) Viterbi 92.25
SNoW left-to-right 92.02
SNoW right-to-left 91.43
SNoW 92.41
Table 1: Comparison with previous work. Training
data is WSJ section 00 thorough 24 except section
20 of PTB. Test data is WSJ section 20. Size of tag
set is 479. acca11 = percentage of accuracy. The num-
ber of Srinivas(97) is based on footnote 1 of (Chen et
al., 1999). The number of Chen(01) width=5 is the
result of a beam search on Model 8 with the width
of 5.
model acca11 (22) acca11 (23)
Xia(01) trigram 83.60 84.41
SNoW left-to-right 86.01 86.27
Table 2: Results on auto-extracted LTAG grammar.
Training data is WSJ section 02 thorough 21 of PTB.
Test data is WSJ section 22 and 23. Size of supertag
set is 2920. acca11 = percentage of accuracy.
tary trees in Xia’s grammar a137a138a49 , so that the supertags
are more specialized and hence there is much more
ambiguity in supertagging. We have experimented
with our model on a137a139a49 and her dataset. We train our
left-to-right model on WSJ section 02 through 21 of
PTB, and test on section 22 and 23. We achieve an
average error reduction of a9a27a15a6a5a14a13a136a11 . The reason why
the accuracy is rather low is that systems using a137a138a49
have to cope with much more ambiguities due the
large size of the supertag set. The results are shown
in Table 2.
We test on both normalized and unnormalized
models with both hand coded supertag set and auto-
extracted supertag set. We use the left-to-right
SNoW model in these experiments. The results in
Table 3 show that skipping the local normalization
improves performance in all the systems. The ef-
fect of skipping normalization is more significant on
auto-extracted tags. We think this is because sparse
tag set size norm? acca11 (20/22/23)
auto 2920 yes NA / 85.77 / 85.98
auto 2920 no NA / 86.01 / 86.27
hand 479 yes 91.98 / NA / NA
hand 479 no 92.02 / NA / NA
Table 3: Experiments on normalized and unnormal-
ized models using left-to-right SNoW supertagger.
size = size of the tag set. norm? = normalized or
not. acca11 = percentage of accuracy on section 20,
22 and 23. auto = auto-extracted tag set. hand =
hand coded tag set.
data is more vulnerable to the label bias problem.
6 Application to NP Chunking
Now we come back to the NP chunking problem.
The standard dataset of NP chunking consists of
WSJ section 15-18 as train data and section 20 as test
data. In our approach, we substitute the supertags
for the POS tags in the dataset. The new data look
as follows.
For B Pnxs O
the B Dnx I
nine B Dnx I
months A NXN I
The first field is the word, the second is the su-
pertag of the word, and the last is the IOB tag.
We first use the fast TBL (Ngai and Florian, 2001),
a Transformation Based Learning algorithm, to re-
peat Ramshaw and Marcus’ experiment, and then
apply the same program to our new dataset. Since
section 15-18 and section 20 are in the standard data
set of NP chunking, we need to avoid using these
sections as training data for our supertagger. We
have trained another supertagger that is trained on
776K words in WSJ section 02-14 and 21-24, and it
is tuned with 44K words in WSJ section 19. We use
this supertagger to supertag section 15-18 and sec-
tion 20. We train an NP Chunker on section 15-18
with fast TBL, and test it on section 20.
There is a small problem with the supertag set that
we have been using, as far as NP chunking is con-
cerned. Two words with different POS tags may be
tagged with the same supertag. For example both de-
terminer (DT) and number (CD) can be tagged with
B Dnx. However this will be harmful in the case
model A P R F
RM95 - 91.80 92.27 92.03
Brill-POS 97.42 91.83 92.20 92.01
Tri-STAG 97.29 91.60 91.72 91.66
SNoW-STAG 97.66 92.76 92.34 92.55
SNoW-STAG2 97.70 92.86 93.05 92.95
GOLD-POS 97.91 93.17 93.51 93.34
GOLD-STAG 98.48 94.74 95.63 95.18
Table 4: Results on NP Chunking. Training data is
WSJ section 15-18 of PTB. Test data is WSJ section
20. A = Accuracy of IOB tagging. P = NP chunk
Precision. R = NP chunk Recall. F = F-score. Brill-
POS = fast TBL with Brill’s POS tags. Tri-STAG =
fast TBL with supertags given by Srinivas’ trigram-
based supertagger. SNoW-STAG = fast TBL with
supertags given by our SNoW supertagger. SNoW-
STAG2 = fast TBL with augmented supertags given
by our SNoW supertagger. GOLD-POS = fast TBL
with gold standard POS tags. GOLD-STAG = fast
TBL with gold standard supertags.
of NP Chunking. As a solution, we use augmented
supertags that have the POS tag of the lexical item
specified. An augmented supertag can also be re-
garded as concatenation of a supertag and a POS tag.
For B Pnxs(IN) O
the B Dnx(DT) I
nine B Dnx(CD) I
months A NXN(NNS) I
The results are shown in Table 4. The system
using augmented supertags achieves an F-score of
a2a4a3a6a5a17a2a4a18a16a11 , or an error reduction of a9a4a9a21a5a17a140a16a11 below the
baseline of using Brill POS tags. Although these two
systems are both trained with the same TBL algo-
rithm, we implicitly employ more linguistic knowl-
edge as the learning bias when we train the learn-
ing machine with supertags. Supertags encode more
syntactical information than POS tag do.
For example, in the sentence Three leading drug
companies ..., the POS tag of a141a52a76a12a84 a135a136a142 a28a50a143 is VBG, or
present participle. Based on the local context of
a141a52a76a27a84
a135a16a142
a28a50a143 , Three can be the subject of leading. How-
ever, the supertag of leading is B An, which repre-
sents a modifier of a noun. With this extra informa-
tion, the chunker can easily solve the ambiguity. We
find many instances like this in the test data.
It is important to note that the accuracy of su-
pertag itself is much lower than that of POS tag
while the use of supertags helps to improve the over-
all performance. On the other hand, since the accu-
racy of supertagging is rather lower, there is more
room left for improving.
If we use gold standard POS tags in the previ-
ous experiment, we can only achieve an F-score of
a2a4a15a6a5a17a15a65a7a23a11 . However, if we use gold standard supertags
in our previous experiment, the F-score is as high
as a2a4a18a6a5a26a9a27a140a16a11 . This tells us how much room there
is for further improvements. Improvements in su-
pertagging may give rise to further improvements in
chunking.
7 Conclusions
We have proposed the use of supertags in the NP
chunking task in order to use more syntactical de-
pendencies which are unavailable with POS tags. In
order to train a supertagger with a larger context, we
have proposed a novel method of applying SNoW to
the sequential model and have applied it to supertag-
ging. Our algorithm takes advantage of rich feature
sets, avoids the sparse-data problem, and forces the
learning algorithm to focus on the difficult cases.
Being aware of the fact that our algorithm may suf-
fer from the label bias problem, we have used two
methods to cope with this problem, and achieved de-
sirable results.
We have tested our algorithms on both the hand-
coded tag set used in (Chen et al., 1999) and su-
pertags extracted for Penn Treebank(PTB). On the
same dataset as that of (Chen et al., 1999), our new
supertagger achieves an accuracy of a2a4a3a6a5a8a7a10a9a12a11 . Com-
pared with the supertaggers with the same decoding
complexity (Chen, 2001), our algorithm achieves an
error reduction of a22a23a5a26a9a12a11 .
We repeat Ramshaw and Marcus’ Transforma-
tion Based NP chunking (Ramshaw and Marcus,
1995) test by substituting supertags for POS tags
in the dataset. The use of supertags in NP chunk-
ing gives rise to almost a9a12a11 absolute increase (from
a2a4a3a6a5a14a13a16a15a16a11 to a2a4a3a6a5a17a2a4a18a16a11 ) in F-score under Transformation
Based Learning(TBL) frame, or an error reduction
of a9a4a9a21a5a17a140a16a11 .
The accuracy of a2a4a3a6a5a17a2a4a18a16a11 with our individual TBL
chunker is close to results of POS-tag-based systems
using advanced machine learning algorithms, such
as a2a4a15a6a5a17a15a65a7a23a11 by voted MBL chunkers (Sang, 2002),
a2a4a3a6a5a17a140a16a11 by SNoW chunker (Mu˜noz et al., 1999). The
benefit of using a supertagger is obvious. The su-
pertagger provides an opportunity for advanced ma-
chine learning techniques to improve their perfor-
mance on chunking tasks by exploiting more syn-
tactic information encoded in the supertags.
To sum up, the supertagging algorithm presented
here provides an effective and efficient way to em-
ploy syntactic information.
Acknowledgments
We thank Vasin Punyakanok for help on the use of
SNoW in sequential inference, John Chen for help
on dataset and evaluation methods and comments
on the draft. We also thank Srinivas Bangalore and
three anonymous reviews for helpful comments.
References
S. Abney. 1991. Parsing by chunks. In Principle-Based
Parsing. Kluwer Academic Publishers.
E. Brill. 1995. Transformation-based error-driven learn-
ing and natural language processing: A case study
in part-of-speech tagging. Computational Linguistics,
21(4):543–565.
J. Chen, B. Srinivas, and K. Vijay-Shanker. 1999. New
models for improving supertag disambiguation. In
Proceedings of the 9th EACL.
J. Chen. 2001. Towards Efficient Statistical Parsing us-
ing Lexicalized Grammatical Information. Ph.D. the-
sis, University of Delaware.
M. Collins. 2002. Discriminative training methods for
hidden markov models: Theory and experiments with
perceptron algorithms. In EMNLP 2002.
A. Joshi and Y. Schabes. 1997. Tree-adjoining gram-
mars. In G. Rozenberg and A. Salomaa, editors,
Handbook of Formal Languages, volume 3, pages 69
– 124. Springer.
A. Joshi and B. Srinivas. 1994. Disambiguation of su-
per parts of speech (or supertags): Almost parsing. In
COLING’94.
T. Kudo and Y. Matsumoto. 2001. Chunking with sup-
port vector machines. In Proceedings of NAACL 2001.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Condi-
tional random fields: Probabilistic models for stgmen-
tation and labeling sequence data. In Proceedings of
ICML 2001.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1994. Building a large annotated corpus of en-
glish: the penn treebank. Computational Linguistics,
19(2):313–330.
M. Mu˜noz, V. Punyakanok, D. Roth, and D. Zimak.
1999. A learning approach to shallow parsing. In Pro-
ceedings of EMNLP-WVLC’99.
G. Ngai and R. Florian. 2001. Transformation-based
learning in the fast lane. In Proceedings of NAACL-
2001, pages 40–47.
V. Punyakanok and D. Roth. 2000. The use of classifiers
in sequential inference. In NIPS’00.
L. Ramshaw and M. Marcus. 1995. Text chunking using
transformation-based learning. In Proceedings of the
3rd WVLC.
A. Ratnaparkhi. 1996. A maximum entropy part-of-
speech tagger. In Proceedings of EMNLP 96.
D. Roth. 1998. Learning to resolve natural language am-
biguities: A unified approach. In AAAI’98.
Erik F. Tjong Kim Sang. 2002. Memory-based shal-
low parsing. Journal of Machine Learning Research,
2:559–594.
F. Sha and F. Pereira. 2003. Shallow parsing with condi-
tional random fields. In Proceedings of NAACL 2003.
B. Srinivas and A. Joshi. 1999. Supertagging: An ap-
proach to almost parsing. Computational Linguistics,
25(2).
B. Srinivas. 1997. Performance evaluation of supertag-
ging for partial parsing. In IWPT 1997.
H. van Halteren, J. Zavrel, and W. Daelmans. 1998. Im-
proving data driven wordclass tagging by system com-
bination. In Proceedings of COLING-ACL 98.
F. Xia. 2001. Automatic Grammar Generation From
Two Different Perspectives. Ph.D. thesis, University
of Pennsylvania.
XTAG-Group. 2001. A lexicalized tree adjoining gram-
mar for english. Technical Report 01-03, IRCS, Univ.
of Pennsylvania.
T. Zhang, F. Damerau, and D. Johnson. 2001. Text
chunking using regularized winnow. In Proceedings
of ACL 2001.
