Ranking Algorithms for Named–Entity Extraction:
Boosting and the Voted Perceptron
Michael Collins
AT&T Labs-Research, Florham Park, New Jersey.
mcollins@research.att.com
Abstract
This paper describes algorithms which
rerank the top N hypotheses from a
maximum-entropy tagger, the applica-
tion being the recovery of named-entity
boundaries in a corpus of web data. The
first approach uses a boosting algorithm
for ranking problems. The second ap-
proach uses the voted perceptron algo-
rithm. Both algorithms give compara-
ble, significant improvements over the
maximum-entropy baseline. The voted
perceptron algorithm can be considerably
more efficient to train, at some cost in
computation on test examples.
1 Introduction
Recent work in statistical approaches to parsing and
tagging has begun to consider methods which in-
corporate global features of candidate structures.
Examples of such techniques are Markov Random
Fields (Abney 1997; Della Pietra et al. 1997; John-
son et al. 1999), and boosting algorithms (Freund et
al. 1998; Collins 2000; Walker et al. 2001). One
appeal of these methods is their flexibility in incor-
porating features into a model: essentially any fea-
tures which might be useful in discriminating good
from bad structures can be included. A second ap-
peal of these methods is that their training criterion
is often discriminative, attempting to explicitly push
the score or probability of the correct structure for
each training sentence above the score of competing
structures. This discriminative property is shared by
the methods of (Johnson et al. 1999; Collins 2000),
and also the Conditional Random Field methods of
(Lafferty et al. 2001).
In a previous paper (Collins 2000), a boosting al-
gorithm was used to rerank the output from an ex-
isting statistical parser, giving significant improve-
ments in parsing accuracy on Wall Street Journal
data. Similar boosting algorithms have been applied
to natural language generation, with good results, in
(Walker et al. 2001). In this paper we apply rerank-
ing methods to named-entity extraction. A state-of-
the-art (maximum-entropy) tagger is used to gener-
ate 20 possible segmentations for each input sen-
tence, along with their probabilities. We describe
a number of additional global features of these can-
didate segmentations. These additional features are
used as evidence in reranking the hypotheses from
the max-ent tagger. We describe two learning algo-
rithms: the boosting method of (Collins 2000), and a
variant of the voted perceptron algorithm, which was
initially described in (Freund & Schapire 1999). We
applied the methods to a corpus of over one million
words of tagged web data. The methods give signif-
icant improvements over the maximum-entropy tag-
ger (a 17.7% relative reduction in error-rate for the
voted perceptron, and a 15.6% relative improvement
for the boosting method).
One contribution of this paper is to show that ex-
isting reranking methods are useful for a new do-
main, named-entity tagging, and to suggest global
features which give improvements on this task. We
should stress that another contribution is to show
that a new algorithm, the voted perceptron, gives
very credible results on a natural language task. It is
an extremely simple algorithm to implement, and is
very fast to train (the testing phase is slower, but by
no means sluggish). It should be a viable alternative
to methods such as the boosting or Markov Random
Field algorithms described in previous work.
2 Background
2.1 The data
Over a period of a year or so we have had over one
million words of named-entity data annotated. The
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 489-496.
                         Proceedings of the 40th Annual Meeting of the Association for
data is drawn from web pages, the aim being to sup-
port a question-answering system over web data. A
number of categories are annotated: the usual peo-
ple, organization and location categories, as well as
less frequent categories such as brand-names, scien-
tific terms, event titles (such as concerts) and so on.
From this data we created a training set of 53,609
sentences (1,047,491 words), and a test set of 14,717
sentences (291,898 words).
The task we consider is to recover named-entity
boundaries. We leave the recovery of the categories
of entities to a separate stage of processing.1 We
evaluate different methods on the task through pre-
cision and recall. If a method proposes a0 entities on
the test set, and a1 of these are correct (i.e., an entity is
marked by the annotator with exactly the same span
as that proposed) then the precision of a method is
a2a4a3a5a3a7a6a9a8
a1a11a10
a0 . Similarly, if
a12 is the total number of en-
tities in the human annotated version of the test set,
then the recall is a2a4a3a5a3a7a6a13a8 a1a11a10a14a12 .
2.2 The baseline tagger
The problem can be framed as a tagging task – to
tag each word as being either the start of an entity,
a continuation of an entity, or not to be part of an
entity at all (we will use the tags S, C and N respec-
tively for these three cases). As a baseline model
we used a maximum entropy tagger, very similar to
the ones described in (Ratnaparkhi 1996; Borthwick
et. al 1998; McCallum et al. 2000). Max-ent tag-
gers have been shown to be highly competitive on a
number of tagging tasks, such as part-of-speech tag-
ging (Ratnaparkhi 1996), named-entity recognition
(Borthwick et. al 1998), and information extraction
tasks (McCallum et al. 2000). Thus the maximum-
entropy tagger we used represents a serious baseline
for the task. We used the following features (sev-
eral of the features were inspired by the approach
of (Bikel et. al 1999), an HMM model which gives
excellent results on named entity extraction):
a15 The word being tagged, the previous word, and
the next word.
a15 The previous tag, and the previous two tags (bi-
gram and trigram features).
1In initial experiments, we found that forcing the tagger to
recover categories as well as the segmentation, by exploding the
number of tags, reduced performance on the segmentation task,
presumably due to sparse data problems.
a15 A compound feature of three fields: (a) Is the
word at the start of a sentence?; (b) does the word
occur in a list of words which occur more frequently
as lower case rather than upper case words in a large
corpus of text? (c) the type of the first letter a16 of
the word, where a17a19a18 a0a21a20a23a22 a16a25a24 is defined as ‘A’ if a16 is a
capitalized letter, ‘a’ if a16 is a lower-case letter, ‘0’
if a16 is a digit, and a16 otherwise. For example, if the
word Animal is seen at the start of a sentence, and
it occurs in the list of frequent lower-cased words,
then it would be mapped to the feature 1-1-A.
a15 The word with each character mapped to its
a17a19a18
a0a21a20 . For example, G.M. would be mapped to
A.A., and Animal would be mapped to Aaaaaa.
a15 The word with each character mapped to its
type, but repeated consecutive character types are
not repeated in the mapped string. For example, An-
imal would be mapped to Aa, G.M. would again be
mapped to A.A..
The tagger was applied and trained in the same
way as described in (Ratnaparkhi 1996). The feature
templates described above are used to create a set of
a26 binary features
a27a29a28
a22
a17a31a30a33a32a21a24 , where a17 is the tag, and a32
is the “history”, or context. An example is
a27a23a34a36a35a37a35
a22
a17a31a30a33a32a21a24a39a38
a40a41
a42
a41a43
a2 if t = S and the
word being tagged = “Mr.”
a3 otherwise
The parameters of the model are a44a45a28 for a46a39a38 a2a48a47a4a47a4a47 a26 ,
defining a conditional distribution over the tags
given a history a32 as
a49
a22
a17a4a50a32a21a24a39a38
a20a52a51a54a53a31a55
a53a57a56a19a53a59a58a61a60a59a62a63a14a64
a51
a60a66a65
a20 a51 a53a67a55
a53a68a56a19a53a59a58a61a60a65 a62a63a14a64
The parameters are trained using Generalized Iter-
ative Scaling. Following (Ratnaparkhi 1996), we
only include features which occur 5 times or more
in training data. In decoding, we use a beam search
to recover 20 candidate tag sequences for each sen-
tence (the sentence is decoded from left to right,
with the top 20 most probable hypotheses being
stored at each point).
2.3 Applying the baseline tagger
As a baseline we trained a model on the full 53,609
sentences of training data, and decoded the 14,717
sentences of test data. This gave 20 candidates per
test sentence, along with their probabilities. The
baseline method is to take the most probable candi-
date for each test data sentence, and then to calculate
precision and recall figures. Our aim is to come up
with strategies for reranking the test data candidates,
in such a way that precision and recall is improved.
In developing a reranking strategy, the 53,609
sentences of training data were split into a 41,992
sentence training portion, and a 11,617 sentence de-
velopment set. The training portion was split into
5 sections, and in each case the maximum-entropy
tagger was trained on 4/5 of the data, then used to
decode the remaining 1/5. The top 20 hypotheses
under a beam search, together with their log prob-
abilities, were recovered for each training sentence.
In a similar way, a model trained on the 41,992 sen-
tence set was used to produce 20 hypotheses for each
sentence in the development set.
3 Global features
3.1 The global-feature generator
The module we describe in this section generates
global features for each candidate tagged sequence.
As input it takes a sentence, along with a proposed
segmentation (i.e., an assignment of a tag for each
word in the sentence). As output, it produces a set
of feature strings. We will use the following tagged
sentence as a running example in this section:
Whether/N you/N ’/N re/N an/N aging/N flower/N child/N
or/N a/N clueless/N Gen/S Xer/C ,/N “/N The/S Day/C
They/C Shot/CJohn/C Lennon/C,/N ”/N playing/N at/N the/N
Dougherty/S Arts/C Center/C ,/N entertains/N the/N imagi-
nation/N ./N
An example feature type is simply to list the full
strings of entities that appear in the tagged input. In
this example, this would give the three features
WE=Gen Xer
WE=The Day They Shot John Lennon
WE=Dougherty Arts Center
Here WE stands for “whole entity”. Throughout
this section, we will write the features in this format.
The start of the feature string indicates the feature
type (in this case WE), followed by =. Following the
type, there are generally 1 or more words or other
symbols, which we will separate with the symbol .
A seperate module in our implementation
takes the strings produced by the global-feature
generator, and hashes them to integers. For ex-
ample, suppose the three strings WE=Gen Xer,
WE=The Day They Shot John Lennon,
WE=Dougherty Arts Center were hashed
to 100, 250, and 500 respectively. Conceptually,
the candidate a16 is represented by a large number
of features a32a70a69 a22 a16a25a24 for a71a72a38 a2a48a47a4a47a4a47 a26 where a26 is the
number of distinct feature strings in training data.
In this example, only a32a73a34a36a35a37a35
a22
a16a25a24 , a32a75a74a37a76a37a35
a22
a16a25a24 and a32a77a76a37a35a37a35
a22
a16a73a24
take the value a2 , all other features being zero.
3.2 Feature templates
We now introduce some notation with which to de-
scribe the full set of global features. First, we as-
sume the following primitives of an input candidate:
a15
a17a19a28 for a46a78a38
a2a48a47a4a47a4a47a31a79 is the
a46 ’th tag in the tagged
sequence.
a15a81a80
a28 for a46a48a38
a2a48a47a4a47a4a47a33a79 is the
a46 ’th word.
a15a83a82
a28 for a46a48a38
a2a48a47a4a47a4a47a33a79 is a2 if
a80
a28 begins with a lower-
case letter, a3 otherwise.
a15
a27a29a28 for a46a84a38
a2a48a47a4a47a4a47a33a79 is a transformation of
a80
a28 ,
where the transformation is applied in the same
way as the final feature type in the maximum
entropy tagger. Each character in the word is
mapped to its a17a19a18 a0a21a20 , but repeated consecutive
character types are not repeated in the mapped
string. For example, Animal would be mapped
to Aa in this feature, G.M. would again be
mapped to A.A..
a15
a12a85a28 for a46a83a38
a2a48a47a4a47a4a47a33a79 is the same as
a27a29a28 , but has
an additional flag appended. The flag indi-
cates whether or not the word appears in a dic-
tionary of words which appeared more often
lower-cased than capitalized in a large corpus
of text. In our example, Animal appears in the
lexicon, but G.M. does not, so the two values
for a12a85a28 would be Aa1 and A.A.0 respectively.
In addition, a17a86a28a19a30 a80 a28a86a30a33a27a29a28 and a12a85a28 are all defined to be
NULL if a46a88a87 a2 or a46a88a89 a79 .
Most of the features we describe are anchored on
entity boundaries in the candidate segmentation. We
will use “feature templates” to describe the features
that we used. As an example, suppose that an entity
Description Feature Template
The whole entity string WE=a90a45a91 a90a48a92a91a94a93a96a95a98a97 a99a100a99a100a99 a90a45a101
The a102 a53 features within the entity FF=a102 a91 a102 a92a91a59a93a96a95a98a97 a99a100a99a37a99 a102 a101
The a103 a53 features within the entity GF=a103 a91 a103 a92a91a94a93a104a95a98a97 a99a37a99a19a99 a103 a101
The last word in the entity LW=a90a45a101
Indicates whether the last word is lower-cased LWLC=a105a101
Bigram boundary features of the words before/after the start
of the entity
BO00=a90 a92a91a36a106a107a95a98a97 a90 a91 BO01=a90 a92a91a86a106a107a95a98a97 a103 a91 BO10=a103 a92a91a36a106a107a95a98a97 a90 a91
BO11=a103a14a92a91a86a106a23a95a98a97 a103a108a91
Bigram boundary features of the words before/after the end
of the entity
BE00=a90 a101 a90a48a92a101a94a93a104a95a98a97 BE01=a90 a101 a103a14a92a101a94a93a104a95a98a97 BE10=a103 a101 a90a48a92a101a94a93a104a95a98a97
BE11=a103a67a101 a103a14a92a101a94a93a104a95a98a97
Trigram boundary features of the words before/after the start
of the entity (16 features total, only 4 shown)
TO000=a90a48a92a91a86a106a5a109a94a97 a90a48a92a91a86a106a107a95a98a97 a90 a91 a99a100a99a37a99 TO111=a103a14a92a91a86a106a5a109a94a97 a103a52a92a91a86a106a107a95a98a97 a103 a91
TO2000=a90a48a92a91a86a106a107a95a98a97 a90a45a91 a90a48a92a91a94a93a104a95a98a97a96a99a19a99a100a99 TO2111=a103a14a92a91a86a106a107a95a98a97 a103a108a91 a103a52a92a91a94a93a96a95a98a97
Trigram boundary features of the words before/after the end
of the entity (16 features total, only 4 shown)
TE000=a90a48a92a101a36a106a107a95a98a97 a90a45a101 a90a48a92a101a94a93a104a95a98a97 a99a100a99a37a99 TE111=a103a14a92a101a86a106a23a95a98a97 a103a67a101 a103a14a92a101a94a93a104a95a98a97
TE2000=a90 a92a101a86a106a5a109a94a97 a90 a92a101a36a106a107a95a98a97 a90 a101 a99a19a99a100a99 TE2111=a103 a92a101a86a106a5a109a94a97 a103 a92a101a86a106a23a95a98a97 a103 a101
Prefix features PF=a102a110a91 PF2=a103a67a91 PF=a102a33a91 a102a67a92a91a94a93a104a95a98a97 PF2=a103a108a91 a103a52a92a91a94a93a96a95a98a97
a99a37a99a100a99 PF=a102 a91 a102 a92a91a94a93a104a95a98a97 a99a100a99a100a99 a102 a101 PF2=a103 a91 a103 a92a91a94a93a104a95a98a97 a99a19a99a37a99 a103 a101
Suffix features SF=a102a110a101 SF2=a103a67a101 SF=a102a33a101 a102a67a92a101a86a106a23a95a98a97 SF2=a103a67a101 a103a14a92a101a86a106a23a95a98a97
a99a37a99a100a99 SF=a102 a101 a102 a92a101a86a106a107a95a98a97 a99a37a99a100a99 a102 a91 SF2=a103 a101 a103 a92a101a86a106a107a95a98a97 a99a100a99a100a99 a103 a91
Figure 1: The full set of entity-anchored feature templates. One of these features is generated for each entity
seen in a candidate. We take the entity to span words a71
a47a4a47a4a47
a20 inclusive in the candidate.
is seen from words a71 to a20 inclusive in a segmenta-
tion. Then the WE feature described in the previous
section can be generated by the template
WE=a80 a69 a80 a69a19a111a112a34 a47a4a47a4a47 a80a114a113
Applying this template to the three entities in the
running example generates the three feature strings
described in the previous section. As another exam-
ple, consider the template FF=a27a5a69 a27a5a69a86a111a112a34 a47a4a47a4a47 a27 a113 . This
will generate a feature string for each of the entities
in a candidate, this time using the values a27a5a69
a47a4a47a4a47
a27
a113
rather than a80 a69 a47a4a47a4a47 a80 a113 . For the full set of feature tem-
plates that are anchored around entities, see figure 1.
A second set of feature templates is anchored
around quotation marks. In our corpus, entities (typ-
ically with long names) are often seen surrounded
by quotes. For example, “The Day They Shot John
Lennon”, the name of a band, appears in the running
example. Define a71 to be the index of any double quo-
tation marks in the candidate, a20 to be the index of the
next (matching) double quotation marks if they ap-
pear in the candidate. Additionally, define a20a14a115 to be
the index of the last word beginning with a lower
case letter, upper case letter, or digit within the quo-
tation marks. The first set of feature templates tracks
the values of a12 a28 for the words within quotes:2
Q=a12a5a69 a17a37a69 a12 a58a69a86a111a112a34 a64 a17 a58a69a19a111a112a34 a64 a47a4a47a4a47 a12 a113 a17 a113
Q2=a12 a58a69a37a116a73a34 a64 a17 a58a69a110a116a73a34 a64 a12a117a69 a17a37a69
a47a4a47a4a47
a12 a58
a113
a111a112a34 a64 a17 a58
a113
a111a112a34 a64
2We only included these features if
a118a120a119a122a121a124a123a110a125a122a126a128a127 , to prevent
an explosion in the length of feature strings.
The next set of feature templates are sensitive
to whether the entire sequence between quotes is
tagged as a named entity. Define a129 a115 to be a2 if
a17a37a69a86a111a112a34a88a38 S, and a17a86a28 =C for a46a130a38a131a71a133a132
a115
a47a4a47a4a47
a20a14a115 (i.e.,
a129
a115
a38
a2
if the sequence of words within the quotes is tagged
as a single entity). Also define a134 to be the number
of upper cased words within the quotes, a135 to be the
number of lower case words, and a129 to be a2 if a134a13a136a9a135 ,
a3 otherwise. Then two other templates are:
QF=a129 a115 a134 a135 a12 a58a69a86a111a112a34 a64 a12 a113 a74
QF2=a129 a115 a129 a12 a58a69a86a111a112a34 a64 a12 a113 a74
In the “The Day They Shot John Lennon” example
we would have a129 a115 a38 a2 provided that the entire se-
quence within quotes was tagged as an entity. Ad-
ditionally, a134a137a38a139a138 , a135a140a38 a3 , and a129a141a38 a2 . The val-
ues for a12 a58a69a86a111a112a34 a64 and a12 a113 a74 would be a142a144a143 a2 and a142a144a143 a3 (these
features are derived from The and Lennon, which re-
spectively do and don’t appear in the capitalization
lexicon). This would give QF=a2 a138 a3 a142a144a143 a2 a142a144a143 a3 and
QF2=a2 a2 a142a144a143 a2 a142a145a143 a3 .
At this point, we have fully described the repre-
sentation used as input to the reranking algorithms.
The maximum-entropy tagger gives 20 proposed
segmentations for each input sentence. Each can-
didate a16 is represented by the log probability a135
a22
a16a73a24
from the tagger, as well as the values of the global
features a32a75a69 a22 a16a25a24 for a71a146a38 a2a48a47a4a47a4a47 a26 . In the next sec-
tion we describe algorithms which blend these two
sources of information, the aim being to improve
upon a strategy which just takes the candidate from
the tagger with the highest score for a135 a22 a16a25a24 .
4 Ranking Algorithms
4.1 Notation
This section introduces notation for the reranking
task. The framework is derived by the transforma-
tion from ranking problems to a margin-based clas-
sification problem in (Freund et al. 1998). It is also
related to the Markov Random Field methods for
parsing suggested in (Johnson et al. 1999), and the
boosting methods for parsing in (Collins 2000). We
consider the following set-up:
a15 Training data is a set of example input/output
pairs. In tagging we would have training examples
a147
a71 a28 a30a37a17 a28a19a148 where each a71 a28 is a sentence and each a17 a28 is the
correct sequence of tags for that sentence.
a15 We assume some way of enumerating a set of
candidates for a particular sentence. We use a16a75a28a150a149 to
denote the a151 ’th candidate for the a46 ’th sentence in
training data, and a152 a22 a71 a28 a24a153a38 a147 a16 a28a98a34 a30a37a16 a28a66a74 a47a4a47a4a47a148 to denote
the set of candidates for a71a14a28 . In this paper, the top a154
outputs from a maximum entropy tagger are used as
the set of candidates.
a15 Without loss of generality we take
a16a75a28a57a34 to be the
candidate for a71 a28 which has the most correct tags, i.e.,
is closest to being correct.3
a15a156a155a124a22
a16 a28 a62a149 a24 is the probability that the base model
assigns to a16a75a28 a62a149 . We define a135 a22 a16a75a28 a62a149a29a24a130a38a158a157a66a159a5a160 a155a153a22 a16a75a28 a62a149a14a24 .
a15 We assume a set of a26 additional features,
a32a75a69
a22
a16a73a24
for a71a158a38 a2a48a47a4a47a4a47 a26 . The features could be arbitrary
functions of the candidates; our hope is to include
features which help in discriminating good candi-
dates from bad ones.
a15 Finally, the parameters of the model are a vector
of a26 a132 a2 parameters, a161a162a38 a147 a80 a35a85a30 a80 a34 a47a4a47a4a47 a80a164a163 a148 . The
ranking function is defined as
a129
a22
a16a112a30a37a161a156a24a133a38
a80
a35a4a135
a22
a16a25a24a73a132
a163
a165
a69a19a166a112a34
a80
a69a33a32a70a69
a22
a16a73a24
This function assigns a real-valued number to a can-
didate a16 . It will be taken to be a measure of the
plausibility of a candidate, higher scores meaning
higher plausibility. As such, it assigns a ranking to
different candidate structures for the same sentence,
3In the event that multiple candidates get the same, highest
score, the candidate with the highest value of log-likelihood a167
under the baseline model is taken as a168
a53a120a169
a95 .
and in particular the output on a training or test ex-
ample a71 is a170a85a171a110a160a173a172a124a170a29a174a107a175a85a176a52a177 a58a69 a64 a129 a22 a16a122a30a37a161a156a24 . In this paper we
take the features a32a75a69 to be fixed, the learning problem
being to choose a good setting for the parameters a161 .
In some parts of this paper we will use vec-
tor notation. Define a178 a22 a16a73a24 to be the vector
a147
a135
a22
a16a73a24a31a30a33a32a25a34
a22
a16a73a24
a47a4a47a4a47
a32
a163a179a22
a16a25a24 a148 . Then the ranking score
can also be written as a129 a22 a16a122a30a37a161a180a24a78a38a140a161a140a181a23a178 a22 a16a25a24 where
a182
a181a11a183 is the dot product between vectors
a182 and
a183 .
4.2 The boosting algorithm
The first algorithm we consider is the boosting algo-
rithm for ranking described in (Collins 2000). The
algorithm is a modification of the method in (Freund
et al. 1998). The method can be considered to be a
greedy algorithm for finding the parameters a161 that
minimize the loss function
a135a39a184a5a71a29a71
a22
a161a180a24a130a38
a165
a28
a165
a149a11a185a75a74
a20a14a186
a58a175 a53a120a169a187a11a62a188a39a64 a116
a186
a58a175 a53a66a169
a95
a62a188a39a64
where as before, a129
a22
a16a112a30a37a161a156a24a189a38a190a161a191a181a104a178
a22
a16a25a24 . The theo-
retical motivation for this algorithm goes back to the
PAC model of learning. Intuitively, it is useful to
note that this loss function is an upper bound on the
number of “ranking errors”, a ranking error being a
case where an incorrect candidate gets a higher value
for a129 than a correct candidate. This follows because
for all a16 , a20 a116 a175 a136a193a192a70a194a16a75a195 , where we define a192a70a194a16a75a195 to be a2
for a16a197a196 a3 , and a3 otherwise. Hence
a135a88a184a85a71a29a71
a22
a161a156a24a173a136
a165
a28
a165
a149a11a185a75a74
a192a70a194a199a198 a28 a62a149 a195
where a198a54a28 a62a149a128a38a140a129 a22 a16a75a28 a62a34a4a30a37a161a156a24a201a200a202a129 a22 a16a75a28 a62a149a85a30a37a161a180a24 . Note that
the number of ranking errors is a51
a28
a51
a149a11a185a75a74
a192a21a194a199a198a84a28 a62a149a108a195 .
As an initial step, a80 a35 is set to be
a80
a35a203a38a158a170a85a171a110a160a173a172a124a204a120a205
a206
a165
a28
a165
a149a11a185a75a74
a20
a206
a58a66a207a75a58a175 a53a66a169a187a31a64 a116 a207a75a58a175 a53a66a169
a95
a64a98a64
and all other parameters a80 a69 for a71a208a38
a2a48a47a4a47a4a47
a26 are set
to be zero. The algorithm then proceeds for a154 iter-
ations (a154 is usually chosen by cross validation on a
development set). At each iteration, a single feature
is chosen, and its weight is updated. Suppose the
current parameter values are a161 , and a single feature
a209 is chosen, its weight being updated through an in-
crement a210 , i.e., a80a114a211 a38
a80a114a211
a132a131a210 . Then the new loss,
after this parameter update, will be
a135
a22
a209
a30a33a210a29a24a88a38
a165
a28 a62a149a4a185a75a74
a20
a116a70a212 a53a120a169a187 a111a25a213 a58a214a63a14a215a11a58a175 a53a120a169a187a33a64 a116 a63a52a215a4a58a175 a53a66a169
a95
a64a68a64
where a198a54a28 a62a149a216a38a217a129 a22 a16a75a28 a62a34a11a30a37a161a156a24a133a200a54a129 a22 a16a70a28 a62a149a5a30a37a161a156a24 . The boost-
ing algorithm chooses the feature/update pair a209a96a218 a30a33a210 a218
which is optimal in terms of minimizing the loss
function, i.e.,
a22
a209 a218
a30a33a210
a218
a24a130a38a158a170a85a171a110a160a88a172a153a204a120a205a211
a62a213
a135
a22
a209
a30a33a210a85a24 (1)
and then makes the update a80a114a211a11a219 a38 a80a114a211a11a219 a132a220a210 a218 .
Figure 2 shows an algorithm which implements
this greedy procedure. See (Collins 2000) for a
full description of the method, including justifica-
tion that the algorithm does in fact implement the
update in Eq. 1 at each iteration.4 The algorithm re-
lies on the following arrays:
a142
a111
a211
a38
a147
a22
a46a110a30a36a151a23a24a114a221a70a194a199a32
a211a104a22
a16 a28 a62a34 a24a48a200a54a32
a211a107a22
a16 a28 a62a149 a24a36a195a25a38
a2
a148
a142
a116
a211
a38
a147
a22
a46a110a30a36a151a23a24a114a221a70a194a199a32
a211a104a22
a16a75a28 a62a34a31a24a48a200a54a32
a211a107a22
a16a70a28 a62a149a14a24a36a195a25a38a13a200
a2
a148
a222
a111
a28 a62a149
a38
a147 a209
a221a70a194a199a32
a211a104a22
a16a75a28 a62a34a31a24a48a200a54a32
a211a107a22
a16a70a28 a62a149a14a24a36a195a25a38
a2
a148
a222
a116
a28 a62a149
a38
a147 a209
a221a70a194a199a32
a211a104a22
a16a75a28 a62a34a31a24a48a200a54a32
a211a107a22
a16a70a28 a62a149a14a24a36a195a25a38a13a200
a2
a148
Thus a142 a111a211 is an index from features to cor-
rect/incorrect candidate pairs where the a209 ’th feature
takes value a2 on the correct candidate, and value a3
on the incorrect candidate. The array a142 a116a211 is a simi-
lar index from features to examples. The arrays a222 a111
a28 a62a149
and a222 a116
a28 a62a149
are reverse indices from training examples
to features.
4.3 The voted perceptron
Figure 3 shows the training phase of the percep-
tron algorithm, originally introduced in (Rosenblatt
1958). The algorithm maintains a parameter vector
a161 , which is initially set to be all zeros. The algo-
rithm then makes a pass over the training set, at each
training example storing a parameter vector a161 a28 for
a46a144a38
a2a48a47a4a47a4a47a33a79 . The parameter vector is only modified
when a mistake is made on an example. In this case
the update is very simple, involving adding the dif-
ference of the offending examples’ representations
(a161 a28 a38a223a161 a28a57a116a73a34 a132a224a178 a22 a16a75a28a57a34a108a24a88a200a220a178 a22 a16a70a28a199a149a29a24 in the figure). See
(Cristianini and Shawe-Taylor 2000) chapter 2 for
discussion of the perceptron algorithm, and theory
justifying this method for setting the parameters.
In the most basic form of the perceptron, the pa-
rameter values a161a153a225 are taken as the final parame-
ter settings, and the output on a new test exam-
ple with a16a104a149 for a151a131a38 a2a48a47a4a47a4a47 a26 is simply the highest
4Strictly speaking, this is only the case if the smoothing pa-
rameter a226 is a227 .
Input
a15 Examples
a16a75a28 a62a149 with initial scores a135
a22
a16a75a28 a62a149a14a24
a15 Arrays
a142
a111
a211 ,
a142
a116
a211 ,
a222
a111
a28 a62a149
and a222 a116
a28 a62a149
as described in
section 4.2.
a15 Parameters are number of rounds of boosting
a154 , a smoothing parameter a228 .
Initialize
a15 Set a80
a35 a38a158a170a85a171a110a160a173a172a124a204a120a205
a206
a51
a28 a62a149
a20
a206
a58a214a207a77a58a175 a53a66a169a187a31a64 a116 a207a77a58a175 a53a66a169
a95
a64a68a64
a15 Set
a161a229a38
a147
a80
a35a85a30
a3
a30
a3
a30
a47a4a47a4a47
a148
a15 For all
a46a110a30a36a151 , set a198a54a28 a62a149a203a38
a80
a35a133a194a135
a22
a16a75a28 a62a34a31a24a48a200a84a135
a22
a16a70a28 a62a149a14a24a36a195 .
a15 Set
a230a158a38 a51
a28
a51
a149a4a185a75a74
a20 a116a70a212
a53a120a169a187
a15 For
a209
a38
a2a48a47a4a47a4a47
a26 , calculate
– a231 a111a211 a38 a51
a58a28 a62a149 a64 a176a85a232
a93
a215
a20 a116a70a212
a53a120a169a187
– a231 a116a211 a38 a51
a58a28 a62a149 a64 a176a85a232
a106
a215
a20 a116a70a212
a53a120a169a187
– a222 a20 a71a14a17a19a135a88a184a85a71a29a71
a22
a209
a24a233a38a235a234
a234
a234
a234
a236
a231
a111
a211
a200
a236
a231
a116
a211
a234
a234
a234
a234
Repeat for a17 = 1 to a154
a15 Choose
a209a96a218
a38a131a170a85a171a110a160a88a172a124a170a29a174
a211
a222
a20
a71a11a17a100a135a39a184a5a71a29a71
a22
a209
a24
a15 Set
a210
a218
a38
a34
a74
a157a66a159a5a160a179a237
a93
a215
a219 a111a25a238a94a239
a237
a106
a215
a219 a111a25a238a94a239
a15 Update one parameter, a80a114a211a11a219
a38
a80a114a211a11a219
a132a220a210
a218
a15 for a22
a46a110a30a36a151a107a24a164a240a128a142
a111
a211a11a219
– a241a223a38
a20 a116a70a212
a53a120a169a187 a116a75a213
a219
a200
a20 a116a70a212
a53a66a169a187
– a198a84a28 a62a149a203a38a146a198a84a28 a62a149a88a132a9a210 a218
– for a209 a240 a222 a111
a28 a62a149
, a231 a111a211 a38a131a231 a111a211 a132a220a241
– for a209 a240 a222 a116
a28 a62a149
, a231
a116
a211
a38a131a231
a116
a211
a132a220a241
– a230a202a38a217a230a242a132a9a241
a15 for a22
a46a110a30a36a151a107a24a164a240a128a142
a116
a211 a219
– a241a223a38
a20 a116a70a212
a53a120a169a187 a111a25a213
a219
a200
a20 a116a70a212
a53a66a169a187
– a198a84a28 a62a149a203a38a146a198a84a28 a62a149a114a200a243a210 a218
– for a209 a240 a222 a111
a28 a62a149
, a231 a111a211 a38a131a231 a111a211 a132a220a241
– for a209 a240
a222
a116
a28 a62a149
, a231
a116
a211
a38a131a231
a116
a211
a132a220a241
– a230a202a38a217a230a242a132a9a241
a15 For all features
a209 whose values of
a231
a111
a211
and/or a231 a116a211 have changed, recalculate
a222
a20
a71a11a17a100a135a39a184a5a71a29a71
a22
a209
a24a48a38 a234
a234
a234
a234
a236
a231
a111
a211
a200
a236
a231
a116
a211
a234
a234
a234
a234
Output Final parameter setting a161
Figure 2: The boosting algorithm.
Define: a129 a22 a16a112a30a37a161a156a24a133a38a158a161a223a181a52a178 a22 a16a25a24 .
Input: Examples a16a70a28 a62a149 with feature vectors a178 a22 a16a75a28 a62a149a52a24 .
Initialization: Set parameters a161
a35
a38
a3
For a46a48a38 a2a48a47a4a47a4a47a33a79
a151a179a38a224a170a85a171a33a160a117a172a124a170a29a174
a149a108a166a112a34a37a244a199a244a199a244
a225
a53 a129
a22
a16a75a28a150a149a5a30a37a161
a28a98a116a73a34
a24
If a22a151a245a38 a2 a24 Then a161 a28 a38a224a161 a28a98a116a73a34
Else a161 a28 a38a224a161 a28a98a116a73a34 a132a220a178 a22 a16a75a28a57a34a108a24a48a200a243a178 a22 a16a70a28a199a149a29a24
Output: Parameter vectors a161
a28 for
a46a48a38
a2a48a47a4a47a4a47a33a79
Figure 3: The perceptron training algorithm for
ranking problems.
Define: a129 a22 a16a112a30a37a161a156a24a133a38a158a161a223a181a52a178 a22 a16a25a24 .
Input: A set of candidates a16a107a149 for a151a179a38 a2a48a47a4a47a4a47 a26 ,
A sequence of parameter vectors a161
a28 for
a46a48a38
a2a48a47a4a47a4a47a31a79
Initialization: Set a246a189a194a151a5a195a25a38 a3 for a151a245a38 a2a48a47a4a47a4a47 a26
(a246a189a194a151a85a195 stores the number of votes for a16a107a149 )
For a46a48a38 a2a48a47a4a47a4a47a33a79
a151a179a38a224a170a85a171a33a160a117a172a124a170a29a174
a211
a166a112a34a37a244a199a244a199a244
a163
a129
a22
a16
a211
a30a37a161
a28
a24
a246a153a194a151a5a195a25a38a146a246a208a194a151a85a195a77a132
a2
Output: a16a104a149 where a151a245a38a158a170a85a171a110a160a173a172a124a170a29a174 a211 a246a208a194a209 a195
Figure 4: Applying the voted perceptron to a test
example.
scoring candidate under these parameter values, i.e.,
a16
a211 where
a209
a38a158a170a85a171a110a160a39a172a124a170a29a174a7a149a25a161 a225 a181a52a178
a22
a16a107a149a29a24 .
(Freund & Schapire 1999) describe a refinement
of the perceptron, the voted perceptron. The train-
ing phase is identical to that in figure 3. Note, how-
ever, that all parameter vectors a161 a28 for a46a208a38 a2a48a47a4a47a4a47a110a79
are stored. Thus the training phase can be thought
of as a way of constructing a79 different parame-
ter settings. Each of these parameter settings will
have its own highest ranking candidate, a16
a211 where
a209
a38a158a170a85a171a110a160a173a172a124a170a29a174 a149 a161
a28
a181a100a178
a22
a16 a149 a24 . The idea behind the voted
perceptron is to take each of the a79 parameter set-
tings to “vote” for a candidate, and the candidate
which gets the most votes is returned as the most
likely candidate. See figure 4 for the algorithm.5
5 Experiments
We applied the voted perceptron and boosting algo-
rithms to the data described in section 2.3. Only fea-
tures occurring on 5 or more distinct training sen-
tences were included in the model. This resulted
5Note that, for reasons of explication, the decoding algo-
rithm we present is less efficient than necessary. For example,
when a247 a53a77a248 a247 a53 a106a23a95 it is preferable to use some book-keeping to
avoid recalculation of a249a201a118a66a168a107a250a59a247
a53
a125 and a251a33a252a254a253a70a255a114a251a1a0
a187
a249a201a118a66a168
a187
a250a94a247
a53
a125 .
P R F
Max-Ent 84.4 86.3 85.3
Boosting 87.3(18.6) 87.9(11.6) 87.6(15.6)
Voted 87.3(18.6) 88.6(16.8) 87.9(17.7)
Perceptron
Figure 5: Results for the three tagging methods.
a49
a38 precision,
a2
a38 recall, a129 a38 F-measure. Fig-
ures in parantheses are relative improvements in er-
ror rate over the maximum-entropy model. All fig-
ures are percentages.
in 93,777 distinct features. The two methods were
trained on the training portion (41,992 sentences) of
the training set. We used the development set to pick
the best values for tunable parameters in each algo-
rithm. For boosting, the main parameter to pick is
the number of rounds, a154 . We ran the algorithm for
a total of 300,000 rounds, and found that the op-
timal value for F-measure on the development set
occurred after 83,233 rounds. For the voted per-
ceptron, the representation a178 a22 a16a73a24 was taken to be a
vector a147a4a3 a135 a22 a16a25a24a31a30a33a32 a34 a22 a16a25a24 a47a4a47a4a47 a32 a163 a22 a16a73a24 a148 where a3 is a pa-
rameter that influences the relative contribution of
the log-likelihood term versus the other features. A
value of a3 a38 a3a104a47a6a5 was found to give the best re-
sults on the development set. Figure 5 shows the
results for the three methods on the test set. Both of
the reranking algorithms show significant improve-
ments over the baseline: a 15.6% relative reduction
in error for boosting, and a 17.7% relative error re-
duction for the voted perceptron.
In our experiments we found the voted percep-
tron algorithm to be considerably more efficient in
training, at some cost in computation on test exam-
ples. Another attractive property of the voted per-
ceptron is that it can be used with kernels, for exam-
ple the kernels over parse trees described in (Collins
and Duffy 2001; Collins and Duffy 2002). (Collins
and Duffy 2002) describe the voted perceptron ap-
plied to the named-entity data in this paper, but us-
ing kernel-based features rather than the explicit fea-
tures described in this paper. See (Collins 2002) for
additional work using perceptron algorithms to train
tagging models, and a more thorough description of
the theory underlying the perceptron algorithm ap-
plied to ranking problems.
6 Discussion
A question regarding the approaches in this paper
is whether the features we have described could be
incorporated in a maximum-entropy tagger, giving
similar improvements in accuracy. This section dis-
cusses why this is unlikely to be the case. The prob-
lem described here is closely related to the label bias
problem described in (Lafferty et al. 2001).
One straightforward way to incorporate global
features into the maximum-entropy model would be
to introduce new features a27 a22 a32a45a30a37a17a37a24 which indicated
whether the tagging decision a17 in the history a32 cre-
ates a particular global feature. For example, we
could introduce a feature
a27a23a34a8a7a37a35
a22
a17a108a30a33a32a70a24a39a38
a40a41
a42
a41a43
a2 if t = N and this decision
creates an LWLC=1 feature
a3 otherwise
As an example, this would take the value a2 if its was
tagged as N in the following context,
She/N praised/N the/N University/S for/C its/? efforts to a99a19a99a100a99
because tagging its as N in this context would create
an entity whose last word was not capitalized, i.e.,
University for. Similar features could be created for
all of the global features introduced in this paper.
This example also illustrates why this approach
is unlikely to improve the performance of the
maximum-entropy tagger. The parameter a44a130a34a8a7a37a35 as-
sociated with this new feature can only affect the
score for a proposed sequence by modifying a0a233a22 a17a4a50a32a21a24
at the point at which a27 a34a8a7a37a35 a22 a17a108a30a33a32a70a24a245a38 a2 . In the exam-
ple, this means that the LWLC=1 feature can only
lower the score for the segmentation by lowering the
probability of tagging its as N. But its has almost
probably a2 of not appearing as part of an entity, so
a0a233a22
a154a220a50a32a70a24 should be almost
a2 whether
a27 a34a8a7a37a35 is
a2 or a3
in this context! The decision which effectively cre-
ated the entity University for was the decision to tag
for as C, and this has already been made. The inde-
pendence assumptions in maximum-entropy taggers
of this form often lead points of local ambiguity (in
this example the tag for the word for) to create glob-
ally implausible structures with unreasonably high
scores. See (Collins 1999) section 8.4.2 for a dis-
cussion of this problem in the context of parsing.
Acknowledgements Many thanks to Jack Minisi for
annotating the named-entity data used in the exper-
iments. Thanks also to Nigel Duffy, Rob Schapire
and Yoram Singer for several useful discussions.
References
Abney, S. 1997. Stochastic Attribute-Value Grammars. Compu-
tational Linguistics, 23(4):597-618.
Bikel, D., Schwartz, R., and Weischedel, R. (1999). An Algo-
rithm that Learns What’s in a Name. In Machine Learning:
Special Issue on Natural Language Learning, 34(1-3).
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R.
(1998). Exploiting Diverse Knowledge Sources via Maxi-
mum Entropy in Named Entity Recognition. Proc. of the
Sixth Workshop on Very Large Corpora.
Collins, M. (1999). Head-Driven Statistical Models for Natural
Language Parsing. PhD Thesis, University of Pennsylvania.
Collins, M. (2000). Discriminative Reranking for Natural Lan-
guage Parsing. Proceedings of the Seventeenth International
Conference on Machine Learning (ICML 2000).
Collins, M., and Duffy, N. (2001). Convolution Kernels for Nat-
ural Language. In Proceedings of NIPS 14.
Collins, M., and Duffy, N. (2002). New Ranking Algorithms for
Parsing and Tagging: Kernels over Discrete Structures, and
the Voted Perceptron. In Proceedings of ACL 2002.
Collins, M. (2002). Discriminative Training Methods for Hid-
den Markov Models: Theory and Experiments with the Per-
ceptron Algorithm. In Proceedings of EMNLP 2002.
Cristianini, N., and Shawe-Tayor, J. (2000). An introduction to
Support Vector Machines and other kernel-based learning
methods. Cambridge University Press.
Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997). Induc-
ing Features of Random Fields. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 19(4), pp. 380-393.
Freund, Y. & Schapire, R. (1999). Large Margin Classifica-
tion using the Perceptron Algorithm. In Machine Learning,
37(3):277–296.
Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1998). An effi-
cient boosting algorithm for combining preferences. In Ma-
chine Learning: Proceedings of the Fifteenth International
Conference.
Johnson, M., Geman, S., Canon, S., Chi, Z. and Riezler, S.
(1999). Estimators for Stochastic “Unification-based” Gram-
mars. Proceedings of the ACL 1999.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional
random fields: Probabilistic models for segmenting and la-
beling sequence data. In Proceedings of ICML 2001.
McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum
entropy markov models for information extraction and seg-
mentation. In Proceedings of ICML 2000.
Ratnaparkhi, A. (1996). A maximum entropy part-of-speech
tagger. In Proceedings of the empirical methods in natural
language processing conference.
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model
for Information Storage and Organization in the Brain. Psy-
chological Review, 65, 386–408. (Reprinted in Neurocom-
puting (MIT Press, 1998).)
Walker, M., Rambow, O., and Rogati, M. (2001). SPoT: a train-
able sentence planner. In Proceedings of the 2nd Meeting of
the North American Chapter of the Association for Compu-
tational Linguistics (NAACL 2001).
