Text Chunking using Regularized Winnow
Tong Zhanga0 and Fred Dameraua1 and David Johnsona2
IBM T.J. Watson Research Center
Yorktown Heights
New York, 10598, USA
a0 tzhang@watson.ibm.com
a1 damerau@watson.ibm.com a2 dejohns@us.ibm.com
Abstract
Many machine learning methods have
recently been applied to natural lan-
guage processing tasks. Among them,
the Winnow algorithm has been ar-
gued to be particularly suitable for NLP
problems, due to its robustness to ir-
relevant features. However in theory,
Winnow may not converge for non-
separable data. To remedy this prob-
lem, a modification called regularized
Winnow has been proposed. In this pa-
per, we apply this new method to text
chunking. We show that this method
achieves state of the art performance
with significantly less computation than
previous approaches.
1 Introduction
Recently there has been considerable interest in
applying machine learning techniques to prob-
lems in natural language processing. One method
that has been quite successful in many applica-
tions is the SNoW architecture (Dagan et al.,
1997; Khardon et al., 1999). This architecture
is based on the Winnow algorithm (Littlestone,
1988; Grove and Roth, 2001), which in theory
is suitable for problems with many irrelevant at-
tributes. In natural language processing, one of-
ten encounters a very high dimensional feature
space, although most of the features are irrele-
vant. Therefore the robustness of Winnow to high
dimensional feature space is considered an impor-
tant reason why it is suitable for NLP tasks.
However, the convergence of the Winnow al-
gorithm is only guaranteed for linearly separable
data. In practical NLP applications, data are of-
ten linearly non-separable. Consequently, a di-
rect application of Winnow may lead to numer-
ical instability. A remedy for this, called regu-
larized Winnow, has been recently proposed in
(Zhang, 2001). This method modifies the origi-
nal Winnow algorithm so that it solves a regular-
ized optimization problem. It converges both in
the linearly separable case and in the linearly non-
separable case. Its numerical stability implies that
the new method can be more suitable for practical
NLP problems that may not be linearly separable.
In this paper, we compare regularized Winnow
and Winnow algorithms on text chunking (Ab-
ney, 1991). In order for us to rigorously com-
pare our system with others, we use the CoNLL-
2000 shared task dataset (Sang and Buchholz,
2000), which is publicly available from http://lcg-
www.uia.ac.be/conll2000/chunking. An advan-
tage of using this dataset is that a large number
of state of the art statistical natural language pro-
cessing methods have already been applied to the
data. Therefore we can readily compare our re-
sults with other reported results.
We show that state of the art performance can
be achieved by using the newly proposed regu-
larized Winnow method. Furthermore, we can
achieve this result with significantly less compu-
tation than earlier systems of comparable perfor-
mance.
The paper is organized as follows. In Section 2,
we describe the Winnow algorithm and the reg-
ularized Winnow method. Section 3 describes
the CoNLL-2000 shared task. In Section 4, we
give a detailed description of our system that em-
ploys the regularized Winnow algorithm for text
chunking. Section 5 contains experimental results
for our system on the CoNLL-2000 shared task.
Some final remarks will be given in Section 6.
2 Winnow and regularized Winnow for
binary classification
We review the Winnow algorithm and the reg-
ularized Winnow method. Consider the binary
classification problem: to determine a label a3a5a4
a6a8a7a10a9a12a11a13a9a15a14 associated with an input vector
a16 . A use-
ful method for solving this problem is through lin-
ear discriminant functions, which consist of lin-
ear combinations of the components of the input
variable. Specifically, we seek a weight vector a17
and a threshold a18 such that a17a20a19a21a16a23a22a24a18 if its label
a3a26a25
a7a10a9 and
a17a27a19a28a16a30a29a31a18 if its label a3a32a25
a9 .
For simplicity, we shall assume a18a33a25a35a34 in this
paper. The restriction does not cause problems in
practice since one can always append a constant
feature to the input dataa16 , which offsets the effect
ofa18 .
Given a training set of labeled data
a36
a16a28a37
a11
a3a38a37a40a39
a11a42a41a42a41a42a41a43a11a43a36
a16a45a44
a11
a3a46a44a47a39 , a number of approaches
to finding linear discriminant functions have
been advanced over the years. We are especially
interested in the Winnow multiplicative update
algorithm (Littlestone, 1988). This algorithm
updates the weight vector a17 by going through
the training data repeatedly. It is mistake driven
in the sense that the weight vector is updated
only when the algorithm is not able to correctly
classify an example.
The Winnow algorithm (with positive weight)
employs multiplicative update: if the linear dis-
criminant function misclassifies an input training
vector a16a38a48 with true label a3a46a48 , then we update each
componenta49 of the weight vectora17 as:
a17a51a50a53a52a54a17a51a50a28a55a57a56a46a58
a36a60a59
a16 a48
a50
a3 a48a39
a11 (1)
where a59a62a61 a34 is a parameter called the learning
rate. The initial weight vector can be taken as
a17a51a50a30a25a64a63a45a50
a61
a34 , where a63 is a prior which is typ-
ically chosen to be uniform.
There can be several variants of the Winnow al-
gorithm. One is called balanced Winnow, which
is equivalent to an embedding of the input space
into a higher dimensional space as: a65a16a33a25a67a66a16 a11a42a7 a16a45a68.
This modification allows the positive weight Win-
now algorithm for the augmented input a65a16 to have
the effect of both positive and negative weights
for the original input a16 .
One problem of the Winnow online update al-
gorithm is that it may not converge when the data
are not linearly separable. One may partially rem-
edy this problem by decreasing the learning rate
parameter a59 during the updates. However, this is
rather ad hoc since it is unclear what is the best
way to do so. Therefore in practice, it can be quite
difficult to implement this idea properly.
In order to obtain a systematic solution to this
problem, we shall first examine a derivation of
the Winnow algorithm in (Gentile and Warmuth,
1998), which motivates a more general solution to
be presented later.
Following (Gentile and Warmuth, 1998), we
consider the loss function a69a32a70a15a56 a36a71a7 a17a20a19a21a16a38a48a72a3a46a48a11a34a73a39 ,
which is often called “hinge loss”. For each data
point a36a16 a48a11a3 a48a39 , we consider an online update rule
such that the weighta17a53a48a75a74a76a37 after seeing thea77 -th ex-
ample is given by the solution to
a69a32a78a75a79
a80a82a81a84a83a86a85
a66a88a87
a50
a17 a48a75a74a76a37
a50 a89
a79
a17 a48a75a74a76a37
a50
a90
a17 a48
a50a92a91
a59
a69a32a70a15a56
a36a71a7
a17a10a93a48a75a74a76a37a95a94a96a19 a16 a48a3 a48
a11
a34a73a39a97a68
a41
(2)
Setting the gradient of the above formula to zero,
we obtain
a89
a79
a17a53a48a75a74a76a37
a17 a48
a91
a59a47a98
a80a99a81a100a83a47a85
a25a101a34
a41 (3)
In the above equation, a98 a80a82a81a84a83a47a85 denotes the gra-
dient (or more rigorously, a subgradient) of
a69a32a70a15a56
a36a71a7
a17 a93a48a75a74a76a37a95a94a96a19a102a16a45a48a60a3a103a48
a11
a34a73a39 , which takes the value
a34 if a17 a93a48a104a74a76a37a95a94a104a19 a16 a48a3 a48
a61
a34 , the value
a7
a16 a48a3 a48 if
a17 a93a48a75a74a76a37a95a94a96a19a102a16a45a48a60a3a103a48a105a22 a34 , and a value in between if
a17 a93a48a75a74a76a37a95a94a96a19 a16 a48a3 a48 a25a106a34 . The Winnow update (1) can
be regarded as an approximate solution to (3).
Although the above derivation does not solve
the non-convergence problem of the original Win-
now method when the data are not linearly sepa-
rable, it does provide valuable insights which can
lead to a more systematic solution of the problem.
The basic idea was given in (Zhang, 2001), where
the original Winnow algorithm was converted into
a numerical optimization problem that can handle
linearly non-separable data.
The resulting formulation is closely related to
(2). However, instead of looking at one example
at a time as in an online formulation, we incorpo-
rate all examples at the same time. In addition,
we add a margin condition into the “hinge loss”.
Specifically, we seek a linear weight a107a17 that solves
a69a32a78a75a79
a80
a66a88a87
a50
a17a51a50
a89
a79
a17a51a50
a90
a63a45a50
a91a109a108
a44
a87
a48a75a110a76a37
a69a32a70a15a56
a36a111a9a20a7
a17 a19 a16 a48a3 a48
a11
a34a73a39a97a68
a41
Where
a108
a61
a34 is a given parameter called the reg-
ularization parameter. The optimal solution a107a17 of
the above optimization problem can be derived
from the solution a107
a112 of the following dual opti-
mization problem:
a107
a112
a25a101a69a32a70a15a56
a113
a87
a48
a112
a48
a7
a87
a50
a63 a50 a55a57a56a46a58
a36
a87
a48
a112
a48a16 a48
a50
a3 a48a39
s.t. a112 a48 a4a114a66a34 a11
a108
a68 (a77a115a25
a9a12a11a42a41a42a41a42a41a43a11a117a116 )a41
Thea49 -th component of a107a17 is given by
a107a17a51a50a118a25a5a63a45a50a119a55a57a56a47a58
a36
a44
a87
a48a104a110a76a37
a107
a112
a48a16 a48
a50
a3 a48a39
a41
A Winnow-like update rule can be derived for
the dual regularized Winnow formulation. At
each data point a36a16 a48a11a3 a48a39 , we fix all a112a76a120 with a121a123a122a25a124a77 ,
and update a112
a48
to approximately maximize the
dual objective functional using gradient ascent:
a112
a48a21a125 a69a126a70a15a56
a36
a69a126a78a75a79
a36
a108
a11
a112
a48
a91
a59a82a36a111a9a47a7
a17 a19 a16 a48a3 a48a39a117a39
a11
a34a73a39
a11 (4)
where a17 a50 a25a54a63 a50 a55a57a56a46a58 a36a128a127
a48
a112
a48a129a16a45a48
a50
a3a103a48a128a39 . We update
a112
anda17 by repeatedly going over the data froma77a115a25
a9a12a11a42a41a42a41a42a41a43a11a117a116 .
Learning bounds of regularized Winnow that
are similar to the mistake bound of the original
Winnow have been given in (Zhang, 2001). These
results imply that the new method, while it can
properly handle non-separable data, shares simi-
lar theoretical advantages of Winnow in that it is
also robust to irrelevant features. This theoretical
insight implies that the algorithm is suitable for
NLP tasks with large feature spaces.
3 CoNLL-2000 chunking task
The text chunking task is to divide text into
syntactically related non-overlapping groups of
words (chunks). It is considered an important
problem in natural language processing. As an
example of text chunking, the sentence “Balcor,
which has interests in real estate, said the posi-
tion is newly created.” can be divided as follows:
[NP Balcor], [NP which] [VP has] [NP inter-
ests] [PP in] [NP real estate], [VP said] [NP the
position] [VP is newly created].
In this example, NP denotes non phrase, VP
denotes verb phrase, and PP denotes prepositional
phrase.
The CoNLL-2000 shared task (Sang and Buch-
holz, 2000), introduced last year, is an attempt
to set up a standard dataset so that researchers
can compare different statistical chunking meth-
ods. The data are extracted from sections of the
Penn Treebank. The training set consists of WSJ
sections 15-18 of the Penn Treebank, and the test
set consists of WSJ sections 20. Additionally, a
part-of-speech (POS) tag was assigned to each to-
ken by a standard POS tagger (Brill, 1994) that
was trained on the Penn Treebank. These POS
tags can be used as features in a machine learn-
ing based chunking algorithm. See Section 4 for
detail.
The data contains eleven different chunk types.
However, except for the most frequent three
types: NP (noun phrase), VP (verb phrase), and
PP (prepositional phrase), each of the remaining
chunks has less thana130a8a131 occurrences. The chunks
are represented by the following three types of
tags:
B-X first word of a chunk of type X
I-X non-initial word in an X chunk
O word outside of any chunk
A standard software program has been
provided (which is available from http://lcg-
www.uia.ac.be/conll2000/chunking) to compute
the performance of each algorithm. For each
chunk, three figures of merit are computed:
precision (the percentage of detected phrases that
are correct), recall (the percentage of phrases in
the data that are found), and the a132a76a133
a110a76a37
metric
which is the harmonic mean of the precision and
the recall. The overall precision, recall and a132a76a133
a110a76a37metric on all chunks are also computed. The
overall a132a76a133
a110a76a37
metric gives a single number that
can be used to compare different algorithms.
4 System description
4.1 Encoding of basic features
An advantage of regularized Winnow is its robust-
ness to irrelevant features. We can thus include as
many features as possible, and let the algorithm
itself find the relevant ones. This strategy ensures
that we do not miss any features that are impor-
tant. However, using more features requires more
memory and slows down the algorithm. There-
fore in practice it is still necessary to limit the
number of features used.
Leta134a71a135a136a121a47a137a45a138 a11a134a111a135a136a121a47a137a45a138
a74a76a37
a11a42a41a42a41a42a41a139a11
a134a71a135a136a121a8a140
a11a42a41a42a41a42a41a139a11
a134a71a135a136a121a73a138a111a137
a37
a11
a134a71a135a136a121a8a138
be a string of tokenized text (each token is a word
or punctuation). We want to predict the chunk
type of the current token a134a111a135a136a121a8a140 . For each word
a134a71a135a136a121
a48
, we let a141a38a135a136a142
a48
denote the associated POS tag,
which is assumed to be given in the CoNLL-2000
shared task. The following is a list of the features
we use as input to the regularized Winnow (where
we choose a143a144a25a124a145 ):
a146 first order features:
a134a111a135a136a121
a48
and a141a38a135a136a142
a48
(a77a67a25
a7
a143
a11a42a41a42a41a42a41a13a11
a143 )
a146 second order features:
a141a38a135a136a142
a48a20a147
a141a38a135a136a142a148a50 (a77
a11
a49a5a25
a7
a143
a11a42a41a42a41a42a41a13a11
a143 , a77a149a22a150a49 ), and a141a38a135a136a142
a48 a147
a134a71a135a136a121 a50 (a77a114a25
a7
a143
a11a42a41a42a41a42a41a13a11
a143 ;a49a151a25
a7a10a9a12a11
a34
a11a13a9 )
In addition, since in a sequential process, the
predicted chunk tags a134
a48
for a134a111a135a12a121
a48
are available for
a77a144a22a152a34 , we include the following extra chunk type
features:
a146 first order chunk-type features:
a134
a48
(a77a153a25
a7
a143
a11a42a41a42a41a42a41a13a11a42a7a154a9 )
a146 second order chunk-type features:
a134
a48a155a147
a134a156a50
(a77
a11
a49a157a25
a7
a143
a11a42a41a42a41a42a41a43a11a42a7a10a9 ,
a77a20a22a5a49 ), and POS-chunk
interactions a134
a48 a147
a141a38a135a136a142a50 (a77a158a25
a7
a143
a11a42a41a42a41a42a41a43a11a42a7a10a9a12a159
a49a160a25
a7
a143
a11a42a41a42a41a42a41a13a11
a143 ).
For each data point (corresponding to the cur-
rent token a134a71a135a136a121 a140 ), the associated features are en-
coded as a binary vector a16 , which is the input to
Winnow. Each component of a16 corresponds to a
possible feature value a161 of a feature a162 in one of
the above feature lists. The value of the compo-
nent corresponds to a test which has value one if
the corresponding feature a162 achieves value a161 , or
value zero if the corresponding featurea162 achieves
another feature value.
For example, since a141a38a135a136a142a43a140 is in our feature list,
each of the possible POS value a161 of a141a99a135a12a142a139a140 corre-
sponds to a component of a16 : the component has
value one if a141a38a135a136a142a139a140a163a25a164a161 (the feature value repre-
sented by the component is active), and value zero
otherwise. Similarly for a second order feature in
our feature list such as a141a38a135a136a142a43a140
a147
a141a38a135a136a142
a37
, each pos-
sible value a161a12a140
a147
a161
a37
in the set a6a141a99a135a12a142a139a140
a147
a141a38a135a136a142
a37
a14 is
represented by a component of a16 : the component
has value one if a141a38a135a136a142 a140 a25a35a161 a140 and a141a38a135a136a142
a37
a25a35a161
a37
(the
feature value represented by the component is ac-
tive), and value zero otherwise. The same encod-
ing is applied to all other first order and second
order features, with each possible test of “feature
= feature value” corresponds to a unique compo-
nent in a16 .
Clearly, in this representation, the high order
features are conjunction features that become ac-
tive when all of their components are active. In
principle, one may also consider disjunction fea-
tures that become active when some of their com-
ponents are active. However, such features are
not considered in this work. Note that the above
representation leads to a sparse, but very large di-
mensional vector. This explains why we do not
include all possible second order features since
this will quickly consume more memory than we
can handle.
Also the above list of features are not neces-
sarily the best available. We only included the
most straight-forward features and pair-wise fea-
ture interactions. One might try even higher order
features to obtain better results.
Since Winnow is relatively robust to irrelevant
features, it is usually helpful to provide the algo-
rithm with as many features as possible, and let
the algorithm pick up relevant ones. The main
problem that prohibits us from using more fea-
tures in the Winnow algorithm is memory con-
sumption (mainly in training). The time complex-
ity of the Winnow algorithm does not depend on
the number of features, but rather on the average
number of non-zero features per data, which is
usually quite small.
Due to the memory problem, in our implemen-
tation we have to limit the number of token fea-
tures (words or punctuation) to a130a12a34a136a34a136a34 : we sort the
tokens by their frequencies in the training set from
high frequency to low frequency; we then treat to-
kens of rank a130a12a34a136a34a136a34 or higher as the same token.
Since the number a130a12a34a136a34a136a34 is still reasonably large,
this restriction is relatively minor.
There are possible remedies to the memory
consumption problem, although we have not im-
plemented them in our current system. One so-
lution comes from noticing that although the fea-
ture vector is of very high dimension, most di-
mensions are empty. Therefore one may create a
hash table for the features, which can significantly
reduce the memory consumption.
4.2 Using enhanced linguistic features
We were interested in determining if additional
features with more linguistic content would lead
to even better performance. The ESG (English
Slot Grammar) system in (McCord, 1989) is not
directly comparable to the phrase structure gram-
mar implicit in the WSJ treebank. ESG is a de-
pendency grammar in which each phrase has a
head and dependent elements, each marked with
a syntactic role. ESG normally produces multiple
parses for a sentence, but has the capability, which
we used, to output only the highest ranked parse,
where rank is determined by a system-defined
measure.
There are a number of incompatibilities be-
tween the treebank and ESG in tokenization,
which had to be compensated for in order to trans-
fer the syntactic role features to the tokens in the
standard training and test sets. We also trans-
ferred the ESG part-of-speech codes (different
from those in the WSJ corpus) and made an at-
tempt to attach B-PP, B-NP and I-NP tags as in-
ferred from the ESG dependency structure. In the
end, the latter two tags did not prove useful. ESG
is also very fast, parsing several thousand sen-
tences on an IBM RS/6000 in a few minutes of
clock time.
It might seem odd to use a parser output as in-
put to a machine learning system to find syntactic
chunks. As noted above, ESG or any other parser
normally produces many analyses, whereas in the
kind of applications for which chunking is used,
e.g., information extraction, only one solution is
normally desired. In addition, due to many in-
compatibilities between ESG and WSJ treebank,
less than a165a12a34a73a131 of ESG generated syntactic role
tags are in agreement with WSJ chunks. How-
ever, the ESG syntactic role tags can be regarded
as features in a statistical chunker. Another view
is that the statistical chunker can be regarded as
a machine learned transformation that maps ESG
syntactic role tags into WSJ chunks.
We denote by a162
a48
the syntactic role tag associ-
ated with token a134a71a135a136a121
a48
. Each tag takes one of 138
possible values. The following features are added
to our system.
a146 first order features:
a162
a48
(a77a115a25
a7
a143
a11a42a41a42a41a42a41a139a11
a143 )
a146 second order features: self interactions
a162
a48a166a147
a162 a50 (a77
a11
a49a167a25
a7
a143
a11a42a41a42a41a42a41a43a11
a143 , a77a26a22a168a49 ), and iterations
with POS-tags a162
a48a102a147
a141a99a135a12a142a148a50 (a77
a11
a49a26a25
a7
a143
a11a42a41a42a41a42a41a13a11
a143 ).
4.3 Dynamic programming
In text chunking, we predict hidden states (chunk
types) based on a sequence of observed states
(text). This resembles hidden Markov models
where dynamic programming has been widely
employed. Our approach is related to ideas de-
scribed in (Punyakanok and Roth, 2001). Similar
methods have also appeared in other natural lan-
guage processing systems (for example, in (Ku-
doh and Matsumoto, 2000)).
Given input vectors a16 consisting of features
constructed as above, we apply the regularized
Winnow algorithm to train linear weight vectors.
Since the Winnow algorithm only produces pos-
itive weights, we employ the balanced version
of Winnow with a16 being transformed into a65a16a169a25
a66a16
a11a42a7a10a9a12a11a42a7
a16
a11a13a9
a68. As explained earlier, the constant
term is used to offset the effect of threshold a18 .
Once a weight vector a65a17a170a25a171a66a17
a74
a11
a18
a74
a11
a17a172a137
a11
a18a8a137a82a68 is
obtained, we leta17a124a25a23a17
a74
a7
a17a172a137 anda18a154a25a101a18
a74
a7
a18a73a137 .
The prediction with an incoming feature vector a16
is then a173
a36
a17
a11
a16a82a39a174a25a101a173
a36
a65a17
a11
a65a16a119a39a174a25a23a17a27a19a28a16
a7
a18 .
Since Winnow only solves binary classification
problems, we train one linear classifier for each
chunk type. In this way, we obtain twenty-three
linear classifiers, one for each chunk type a134 . De-
note bya17a27a175 the weight associated with typea134 , then
a straight-forward method to classify an incoming
datum is to assign the chunk tag as the one with
the highest score a173 a36a17 a175a11a16a82a39 .
However, there are constraints in any valid se-
quence of chunk types: if the current chunk is of
type I-X, then the previous chunk type can only be
either B-X or I-X. This constraint can be explored
to improve chunking performance. We denote by
a176 the set of all valid chunk sequences (that is,
the sequence satisfies the above chunk type con-
straint).
Let a134a111a135a136a121
a37
a11a42a41a42a41a42a41a43a11
a134a111a135a136a121a136a177 be the sequence of tok-
enized text for which we would like to find the
associated chunk types. Leta16
a37
a11a42a41a42a41a42a41a43a11
a16 a177 be the as-
sociated feature vectors for this text sequence. Let
a134
a37
a11a42a41a42a41a42a41a139a11
a134a177 be a sequence of potential chunk types
that is valid: a6 a134
a37
a11a42a41a42a41a42a41a43a11
a134a71a177
a14
a4
a176 . In our system,
we find the sequence of chunk types that has the
highest value of overall truncated score as:
a6
a107
a134
a37
a11a42a41a42a41a42a41a43a11
a107
a134a71a177
a14
a25a101a70a12a178a180a179 a69a32a70a15a56
a181
a175
a85a117a182a184a183a184a183a184a183
a175a75a185a166a186a188a187a136a189
a177
a87
a48a75a110a76a37
a173a174a190
a36
a17 a175
a81
a11
a16
a48
a39
a11
where
a173 a190
a36
a17 a175
a81
a11
a16
a48
a39a51a25a101a69a32a78a75a79
a36a111a9a12a11
a69a32a70a15a56
a36a71a7a10a9a12a11
a173
a36
a17 a175
a81
a11
a16
a48
a39a117a39a117a39
a41
The truncation onto the interval a66a7a10a9a12a11a13a9 a68 is to make
sure that no single point contributes too much in
the summation.
The optimization problem
a69a32a70a15a56
a181
a175
a85a182a184a183a184a183a184a183
a175a75a185a174a186a188a187a12a189
a177
a87
a48a104a110a76a37
a173a51a190
a36
a17 a175
a81
a11
a16
a48
a39
can be solved by using dynamic programming.
We build a table of all chunk types for every token
a134a71a135a136a121
a48
. For each fixed chunk typea134a120
a74a76a37
, we define a
value
a191 a36
a134
a120
a74a76a37
a39a166a25 a69a32a70a15a56
a181
a175
a85a182a184a183a184a183a184a183
a175a193a192
a182
a175a193a192
a83a47a85
a186a188a187a136a189
a120
a74a76a37
a87
a48a75a110a76a37
a173a51a190
a36
a17 a175
a81
a11
a16
a48
a39
a41
It is easy to verify that we have the following re-
cursion:
a191 a36
a134
a120
a74a76a37
a39a174a25a101a173 a190
a36
a17 a175a193a192
a83a86a85
a11
a16
a120
a74a76a37
a39
a91
a69a32a70a15a56
a181
a175a192
a182
a175a192
a83a47a85
a186a188a187a12a189
a191 a36
a134
a120
a39
a41
(5)
We also assume the initial condition a191 a36a134a140 a39a160a25a194a34
for alla134a71a140 . Using this recursion, we can iterate over
a121a195a25a196a34
a11a13a9a12a11a42a41a42a41a42a41a15a11a117a197 , and compute a191 a36
a134
a120
a74a76a37
a39 for each
potential chunk typea134a120
a74a76a37
.
Observe that in (5), a16
a120
a74a76a37
depends on the pre-
vious chunk-types a107a134a120 a11a42a41a42a41a42a41a139a11 a107a134a120
a74a76a37
a137a45a138 (where a143a54a25
a145 ). In our implementation, these chunk-
types used to create the current feature vec-
tor a16 a120
a74a76a37
are determined as follows. We
let a107a134a120 a25 a70a12a178a180a179a158a69a126a70a15a56
a175a192
a191 a36
a134
a120
a39 , and let
a107
a134
a120
a137
a48
a25
a70a12a178a180a179a198a69a32a70a15a56
a175a192a180a199
a81a156a200
a181
a175a192a188a199
a81a182a84a201
a175a192a188a199
a81a100a83a47a85
a186a188a187a136a189
a191 a36
a134
a120
a137
a48
a39 for a77 a25
a9a12a11a42a41a42a41a42a41
a143 .
After the computation of all a191 a36a134a120 a39 for a121a202a25
a34
a11a13a9a12a11a42a41a42a41a42a41a15a11a117a197 , we determine the best sequence
a6
a107
a134
a37
a11a42a41a42a41a42a41a13a11
a107
a134a177
a14 as follows. We assign
a107
a134a177 to
the chunk type with the largest value of
a191 a36
a134a71a177a172a39 . Each chunk type
a107
a134a71a177 a137
a37
a11a42a41a42a41a42a41a13a11
a107
a134
a37
is then
determined from the recursion (5) as a107a134a120 a25
a70a12a178a180a179a198a69a32a70a15a56
a175a192
a200
a181
a175a192
a182a84a201
a175a192
a83a47a85
a186a188a187a12a189
a191 a36
a134
a120
a39 .
5 Experimental results
Experimental results reported in this section were
obtained by using
a108
a25
a9 , and a uniform prior of
a63
a48
a25a105a34
a41a75a9 . We let the learning rate a59
a25a203a34
a41
a34
a9 , and
ran the regularized Winnow update formula (4)
repeatedly thirty times over the training data. The
algorithm is not very sensitive to these parame-
ter choices. Some other aspects of the system
design (such as dynamic programming, features
used, etc) have more impact on the performance.
However, due to the limitation of space, we will
not discuss their impact in detail.
Table 1 gives results obtained with the basic
features. This representation gives a total number
ofa204 a41a165
a147
a9
a34a136a205 binary features. However, the number
of non-zero features per datum is a206a73a165 , which de-
termines the time complexity of our system. The
training time on a 400Mhz Pentium machine run-
ning Linux is about sixteen minutes, which cor-
responds to less than one minute per category.
The time using the dynamic programming to pro-
duce chunk predictions, excluding tokenization,
is less than ten seconds. There are about a207
a147
a9
a34a12a208
non-zero linear weight components per chunk-
type, which corresponds to a sparsity of more than
a209
a165a8a131 . Most features are thus irrelevant.
All previous systems achieving a similar per-
formance are significantly more complex. For
example, the previous best result in the litera-
ture was achieved by a combination of 231 kernel
support vector machines (Kudoh and Matsumoto,
2000) with an overall a132a76a133
a110a76a37
value of a209 a204 a41a206a73a165 . Each
kernel support vector machine is computation-
ally significantly more expensive than a corre-
sponding Winnow classifier, and they use an or-
der of magnitude more classifiers. This implies
that their system should be orders of magnitudes
more expensive than ours. This point can be ver-
ified from their training time of about one day on
a 500Mhz Linux machine. The previously sec-
ond best system was a combination of five differ-
ent WPDV models, with an overall a132a21a133
a110a76a37
value
of a209 a204 a41a204a136a145 (van Halteren, 2000). This system is
again more complex than the regularized Win-
now approach we propose (their best single clas-
sifier performance is a132a76a133
a110a76a37
a25
a209
a145
a41
a206a103a207 ). The third
best performance was achieved by using combi-
nations of memory-based models, with an over-
all a132a76a133
a110a76a37
value of a209 a145 a41a130a12a34 . The rest of the eleven
reported systems employed a variety of statisti-
cal techniques such as maximum entropy, Hidden
Markov models, and transformation based rule
learners. Interested readers are referred to the
summary paper (Sang and Buchholz, 2000) which
contains the references to all systems being tested.
testdata precision recall a132a76a133
a110a76a37ADJP 79.45 72.37 75.75
ADVP 81.46 80.14 80.79
CONJP 45.45 55.56 50.00
INTJ 100.00 50.00 66.67
LST 0.00 0.00 0.00
NP 93.86 93.95 93.90
PP 96.87 97.76 97.31
PRT 80.85 71.70 76.00
SBAR 87.10 87.10 87.10
VP 93.69 93.75 93.72
all 93.53 93.49 93.51
Table 1: Our chunk prediction results: with basic
features
The above comparison implies that the regular-
ized Winnow approach achieves state of the art
performance with significant less computation.
The success of this method relies on regularized
Winnow’s ability to tolerate irrelevant features.
This allows us to use a very large feature space
and let the algorithm to pick the relevant ones. In
addition, the algorithm presented in this paper is
simple. Unlike some other approaches, there is
little ad hoc engineering tuning involved in our
system. This simplicity allows other researchers
to reproduce our results easily.
In Table 2, we report the results of our system
with the basic features enhanced by using ESG
syntactic roles, showing that using more linguis-
tic features can enhance the performance of the
system. In addition, since regularized Winnow is
able to pick up relevant features automatically, we
can easily integrate different features into our sys-
tem in a systematic way without concerning our-
selves with the semantics of the features. The re-
sulting overalla132a21a133
a110a76a37
value ofa209 a206 a41a75a9 a204 is appreciably
better than any previous system. The overall com-
plexity of the system is still quite reasonable. The
total number of features is about a206 a41a145
a147
a9
a34a136a205 , with
a165a136a165 nonzero features for each data point. The train-
ing time is about thirty minutes, and the number
of non-zero weight components per chunk-type is
about a165
a147
a9
a34 a208 .
testdata precision recall a132a21a133
a110a76a37ADJP 82.22 72.83 77.24
ADVP 81.06 81.06 81.06
CONJP 50.00 44.44 47.06
INTJ 100.00 50.00 66.67
LST 0.00 0.00 0.00
NP 94.45 94.36 94.40
PP 97.64 98.07 97.85
PRT 80.41 73.58 76.85
SBAR 91.17 88.79 89.96
VP 94.31 94.59 94.45
all 94.24 94.01 94.13
Table 2: Our chunk prediction results: with en-
hanced features
It is also interesting to compare the regularized
Winnow results with those of the original Win-
now method. We only report results with the ba-
sic linguistic features in Table 3. In this exper-
iment, we use the same setup as in the regular-
ized Winnow approach. We start with a uniform
prior of a63
a48
a25a210a34
a41a75a9 , and let the learning rate be
a59
a25a211a34
a41
a34
a9 . The Winnow update (1) is performed
thirty times repeatedly over the data. The training
time is about sixteen minutes, which is approxi-
mately the same as that of the regularized Win-
now method.
Clearly regularized Winnow method has in-
deed enhanced the performance of the original
Winnow method. The improvement is more or
less consistent over all chunk types. It can also be
seen that the improvement is not dramatic. This
is not too surprising since the data is very close to
linearly separable. Even on the testset, the multi-
class classification accuracy is around a209a136a212 a131 . On
average, the binary classification accuracy on the
training set (note that we train one binary classi-
fier for each chunk type) is close to a9 a34a136a34a73a131 . This
means that the training data is close to linearly
separable. Since the benefit of regularized Win-
now is more significant with noisy data, the im-
provement in this case is not dramatic. We shall
mention that for some other more noisy problems
which we have tested on, the improvement of reg-
ularized Winnow method over the original Win-
now method can be much more significant.
testdata precision recall a132a76a133
a110a76a37ADJP 73.54 71.69 72.60
ADVP 80.83 78.41 79.60
CONJP 54.55 66.67 60.00
INTJ 100.00 50.00 66.67
LST 0.00 0.00 0.00
NP 93.36 93.52 93.44
PP 96.83 97.11 96.97
PRT 83.13 65.09 73.02
SBAR 82.89 86.92 84.85
UCP 0.00 0.00 0.00
VP 93.32 93.24 93.28
all 92.77 92.93 92.85
Table 3: Chunk prediction results using original
Winnow (with basic features)
6 Conclusion
In this paper, we described a text chunking sys-
tem using regularized Winnow. Since regularized
Winnow is robust to irrelevant features, we can
construct a very high dimensional feature space
and let the algorithm pick up the important ones.
We have shown that state of the art performance
can be achieved by using this approach. Further-
more, the method we propose is computationally
more efficient than all other systems reported in
the literature that achieved performance close to
ours. Our system is also relatively simple which
does not involve much engineering tuning. This
means that it will be relatively easy for other re-
searchers to implement and reproduce our results.
Furthermore, the success of regularized Winnow
in text chunking suggests that the method might
be applicable to other NLP problems where it is
necessary to use large feature spaces to achieve
good performance.

References

S. P. Abney. 1991. Parsing by chunks. In R. C.
Berwick, S. P. Abney, and C. Tenny, editors,
Principle-Based Parsing: Computation and Psycholinguistics, pages 257-278. Kluwer, Dordrecht.

Eric Brill. 1994. Some advances in rule-based part of
speech tagging. In Proc. AAAI 94, pages 722-727.

I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning in text categorization. In Proceedings of the Second Conference on EMNLP.

C. Gentile and M. K. Warmuth. 1998. Linear hinge
loss and average margin. In Proc. NIPS '98.

A. Grove and D. Roth. 2001. Linear concepts and
hidden variables. Machine Learning, 42:123-141.

R. Khardon, D. Roth, and L. Valiant. 1999. Relational
learning for NLP using linear threshold elements.
In Proceedings IJCAI-99.

Taku Kudoh and Yuji Matsumoto. 2000. Use of support vector learning for chunk identification. In
Proc. CoNLL-2000 and LLL-2000, pages 142-144.

N. Littlestone. 1988. Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2:285-318.

Michael McCord. 1989. Slot grammar: a system for
simple construction of practical natural language
grammars. Natural Language and Logic, pages
118-145.

Vasin Punyakanok and Dan Roth. 2001. The use
of classifiers in sequential inference. In Todd K.
Leen, Thomas G. Dietterich, and Volker Tresp, ed-
itors, Advances in Neural Information Processing
Systems 13, pages 995-1001. MIT Press.

Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.
Introduction to the conll-2000 shared tasks: Chunk-
ing. In Proc. CoNLL-2000 and LLL-2000, pages
127-132.

Hans van Halteren. 2000. Chunking with wpdv models. In Proc. CoNLL-2000 and LLL-2000, pages
154-156.

Tong Zhang. 2001. Regularized winnow methods.
In Advances in Neural Information Processing Sys-
tems 13, pages 703-709.
