Learning a Perceptron-Based Named Entity Chunker
via Online Recognition Feedback
Xavier Carreras and Llu´ıs M`arquez and Llu´ıs Padr´o
TALP Research Center
Departament de Llenguatges i Sistemes Inform`atics
Universitat Polit`ecnica de Catalunya
{carreras,lluism,padro}@lsi.upc.es
1 Introduction
We present a novel approach for the problem of Named
Entity Recognition and Classification (NERC), in the
context of the CoNLL-2003 Shared Task.
Our work is framed into the learning and inference
paradigm for recognizing structures in Natural Language
(Punyakanok and Roth, 2001; Carreras et al., 2002). We
make use of several learned functions which, applied
at local contexts, discriminatively select optimal partial
structures. On the top of this local recognition, an infer-
ence layer explores the partial structures and builds the
optimal global structure for the problem.
For the NERC problem, the structures to be recognized
are the named entity phrases (NE) of a sentence. First, we
apply learning at word level to identify NE candidates by
means of a Begin-Inside classification. Then, we make
use of functions learned at phrase level —one for each
NE category— to discriminate among competing NEs.
We propose a simple online learning algorithm for
training all the involved functions together. Each function
is modeled as a voted perceptron (Freund and Schapire,
1999). The learning strategy works online at sentence
level. When visiting a sentence, the functions being
learned are first used to recognize the NE phrases, and
then updated according to the correctness of their solu-
tion. We analyze the dependencies among the involved
perceptrons and a global solution in order to design a
global update rule based on the recognition of named-
entities, which reflects to each individual perceptron its
committed errors from a global perspective.
The learning approach presented here is closely re-
lated to –and inspired by– some recent works in the area
of NLP and Machine Learning. Collins (2002) adapted
the perceptron learning algorithm to tagging tasks, via
sentence-based global feedback. Crammer and Singer
(2003) presented an online topic-ranking algorithm in-
volving several perceptrons and ranking-based update
rules for training them.
2 Named-Entity Phrase Chunking
In this section we describe our NERC approach as a
phrase chunking problem. First we formalize the prob-
lem of NERC, then we propose a NE-Chunker.
2.1 Problem Formalization
Let x be a sentence belonging to the sentence space X,
formed by n words xi with i ranging from 0 to n−1. Let
K be the set of NE categories, which in the CoNLL-2003
setting is K = {LOC, PER, ORG, MISC}.
A NE phrase, denoted as (s,e)k, is a phrase spanning
from word xs to word xe, having s ≤ e, with category
k ∈ K. Let NE be the set of all potential NE phrases,
expressed as NE = {(s,e)k | 0 ≤ s ≤ e,k ∈ K} .
We say that two different NE phrases ne1 = (s1,e1)k1
and ne2 = (s2,e2)k2 overlap, denoted as ne1∼ne2 iff
e1 ≥ s2 ∧ e2 ≥ s1. A solution for the NERC problem
is a set y formed by NE phrases that do not overlap, also
known as a chunking. We define the set Y as the set of all
possible chunkings. Formally, it can be expressed as:
Y = {y ⊆ NE |∀ne1,ne2 ∈ y ne1negationslash∼ne2}
The goal of the NE extraction problem is to identify
the correct solution y ∈ Y for a given sentence x.
2.2 NE-Chunker
The NE-Chunker is a function which given a sentence
x ∈ X identifies the set of NE phrases y ∈ Y:
NEch : X → Y
The NE-Chunker recognizes NE phrases in two lay-
ers of processing. In the first layer, a set of NE can-
didates for a sentence is identified, out of all the po-
tential phrases in NE. To do so, we apply learning at
word level in order to perform a Begin-Inside classifica-
tion. That is, we assume a function hB(w) which de-
cides whether a word w begins a NE phrase or not, and a
function hI(w) which decides whether a word is inside a
NE phrase or not. Furthermore, we define the predicate
BI∗, which tests whether a certain phrase is formed by
a starting begin word and subsequent inside words. For-
mally, BI∗((s,e)k) = (hB(s) ∧ ∀i : s < i ≤ e : hI(i)).
The recognition will only consider solutions formed by
phases in NE which satisfy the BI∗ predicate. Thus, this
layer is used to filter out candidates from NE and conse-
quently reduce the size of the solution space Y. Formally,
the solution space that is explored can be expressed as
YBI∗ = {y ∈ Y |∀ne∈y BI∗(ne)}.
The second layer selects the best coherent set of NE
phrases by applying learning at phrase level. We assume
a number of scoring functions, which given a NE phrase
produce a real-valued score indicating the plausibility of
the phrase. In particular, for each category k ∈ K we as-
sume a function scorek which produces a positive score if
the phrase is likely to belong to category k, and a negative
score otherwise.
Given this, the NE-Chunker is a function which
searches a NE chunking for a sentence x according to
the following optimality criterion:
NEch(x) = arg max
y∈YBI∗
summationdisplay
(s,e)k∈y
scorek(s,e)
That is, among the considered chunkings of the sen-
tence, the optimal one is defined to be the one whose
NE phrases maximize the summation of phrase scores.
Practically, there is no need to explicitly enumerate each
possible chunking in YBI∗. Instead, by using dynamic
programming the optimal chunking can be found in
quadratic time over the sentence length, performing a
Viterby-style exploration from left to right (Punyakanok
and Roth, 2001).
Summarizing, the NE-Chunker recognizes the set of
NE phrases of a sentence as follows: First, NE candidates
are identified in linear time, applying a linear number of
decisions. Then, the optimal coherent set of NE phrases
is selected in quadratic time, applying a quadratic number
of decisions.
3 Learning via Recognition Feedback
We now present an online learning strategy for training
the learning components of the NE-Chunker, namely the
functions hB and hI and the functions scorek, for k ∈ K.
Each function is implemented using a perceptron1 and
a representation function.
A perceptron is a linear discriminant function h¯w :
Rn → R parametrized by a weight vector ¯w in Rn.
Given an instance ¯x ∈ Rn, a perceptron outputs as
prediction the inner product between vectors ¯x and ¯w,
h¯w(x) = ¯w · ¯x.
1Actually, we use a variant of the model called the voted
perceptron, explained below.
The representation function Φ : X → Rn codifies an
instance x belonging to some space X into a vector inRn
with which the perceptron can operate.
The functions hB and hI predict whether a word begins
or is inside a NE phrase, respectively. Each one consists
of a perceptron weight vector, ¯wB and ¯wI, and a shared
representation function Φw, explained in section 4. Each
function is computed as hl = ¯wl ·Φw(x), for l ∈ {B,I},
and the sign is taken as the binary classification.
The functions scorek, for k ∈ K, compute a score for
a phrase (s,e) being a NE phrase of category k. For each
function there is a vector ¯wk, and a shared representation
function Φp, also explained in section 4. The score is
given by the expression scorek(s,e) = ¯wk ·Φp(s,e).
3.1 Learning Algorithm
We propose a mistake-driven online learning algorithm
for training the parameter vectors ¯w of each perceptron all
in one go. The algorithm starts with all vectors initialized
to ¯0, and then runs repeatedly in a number of epochs T
through all the sentences in the training set. Given a sen-
tence, it predicts its optimal chunking as specified above
using the current vectors. If the predicted chunking is not
perfect the vectors which are responsible of the incorrect
predictions are updated additively.
The sentence-based learning algorithm is as follows:
• Input: {(x1,y1),...,(xm,ym)}.
• Define: W = {¯wB, ¯wI}∪{¯wk|k ∈ K}.
• Initialize: ∀¯w ∈ W ¯w = ¯0;
• for t = 1...T , for i = 1...m :
1. ˆy = NEchW(xi)
2. learning feedback(W,xi,yi, ˆy)
• Output: the vectors in W.
We now describe the learning feedback. Let y∗ be the
gold set of NE phrases for a sentence x, and ˆy the set pre-
dicted by the NE-Chunker. Let goldB(i) and goldI(i) be
respectively the perfect indicator functions for the begin
and inside classifications, that is, they return 1 if word
xi begins or is inside some phrase in y∗ and 0 otherwise.
We differentiate three kinds of phrases in order to give
feedback to the functions being learned:
• Phrases correctly identified: ∀(s,e)k ∈ y∗ ∩ ˆy:
– Do nothing, since they are correct.
• Missed phrases: ∀(s,e)k ∈ y∗ \ ˆy:
1. Update begin word, if misclassified:
if ( ¯wB ·Φw(xs) ≤ 0) then
¯wB = ¯wB + Φw(xs)
2. Update misclassified inside words:
∀i : s < i ≤ e : such that ( ¯wI ·Φw(xi) ≤ 0)
¯wI = ¯wI + Φw(xi)
3. Update score function, if it has been applied:
if ( ¯wB ·Φw(xs) > 0 ∧
∀i : s < i ≤ e : ¯wI ·Φw(xi) > 0) then
¯wk = ¯wk + Φp(s,e)
• Over-predicted phrases: ∀(s,e)k ∈ ˆy \y∗:
1. Update score function:
¯wk = ¯wk −Φp(s,e)
2. Update begin word, if misclassified :
if (goldB(s) = 0) then
¯wB = ¯wB −Φw(xs)
3. Update misclassified inside words :
∀i : s < i ≤ e : such that (goldI(i) = 0)
¯wI = ¯wI −Φw(xi)
This feedback models the interaction between the two
layers of the recognition process. The Begin-Inside iden-
tification filters out phrase candidates for the scoring
layer. Thus, misclassifying words of a correct phrase
blocks the generation of the candidate and produces a
missed phrase. Therefore, we move the begin or end
prediction vectors toward the misclassified words of a
missed phrase. When an incorrect phrase is predicted,
we move away the prediction vectors of the begin and in-
side words, provided that they are not in the beginning or
inside a phrase in the gold chunking. Note that we delib-
erately do not care about false positives begin or inside
words which do not finally over-produce a phrase.
Regarding the scoring layer, each category prediction
vector is moved toward missed phrases and moved away
from over-predicted phrases.
3.2 Voted Perceptron and Kernelization
Although the analysis above concerns the perceptron al-
gorithm, we use a modified version, the voted perceptron
algorithm, introduced in (Freund and Schapire, 1999).
The key point of the voted version is that, while train-
ing, it stores information in order to make better predic-
tions on test data. Specifically, all the prediction vec-
tors ¯wj generated after every mistake are stored, together
with a weight cj, which corresponds to the number of
decisions the vector ¯wj survives until the next mistake.
Let J be the number of vector that a perceptron accumu-
lates. The final hypothesis is an averaged vote over the
predictions of each vector, computed with the expression
h¯w(¯x) =summationtextJj=1 cj( ¯wj · ¯x) .
Moreover, we work with the dual formulation of the
vectors, which allows the use of kernel functions. It is
shown in (Freund and Schapire, 1999) that a vector w can
be expressed as the sum of instances xj that were added
(sxj = +1) or subtracted (sxj = −1) in order to create it,
as w = summationtextJj=1 sxjxj. Given a kernel function K(x,xprime),
the final expression of a dual voted perceptron becomes:
h¯w(¯x) =
Jsummationdisplay
j=1
cj
jsummationdisplay
l=1
sxlK(¯xl, ¯x)
In this paper we work with polynomial kernels
K(x,xprime) = (x · xprime + 1)d, where d is the degree of the
kernel.
4 Feature-Vector Representation
In this section we describe the representation functions
Φw and Φp, which respectively map a word or a phrase
and their local context into a feature vector in Rn, partic-
ularly, {0,1}n. First, we define a set of predicates which
are computed on words and return one or more values:
• Form(w), PoS(w): The form and PoS of word w.
• Orthographic(w): Binary flags of word w with re-
gard to how is it capitalized (initial-caps, all-caps),
the kind of characters that form the word (contains-
digits, all-digits, alphanumeric, Roman-number),
the presence of punctuation marks (contains-
dots, contains-hyphen, acronym), single character
patterns (lonely-initial, punctuation-mark, single-
char), or the membership of the word to a predefined
class (functional-word2), or pattern (URL).
• Affixes(w): The prefixes and suffixes of the word w
(up to 4 characters).
• Word Type Patterns(ws ...we): Type pattern of
consecutive words ws ...we. The type of a word
is either functional (f), capitalized (C), lowercased
(l), punctuation mark (.), quote (’) or other (x).
For instance, the word type pattern for the phrase
“John Smith payed 3 euros” would be CClxl.
For the function Φw(xi) we compute the predicates in
a window of words around xi, that is, words xi+l with
l ∈ [−Lw,+Lw]. Each predicate label, together with
each relative position l and each returned value forms a
final binary indicator feature. The word type patterns are
evaluated in all sequences within the window which in-
clude the central word i.
For the function Φp(s,e) we represent the context of
the phrase by evaluating a [−Lp,0] window of predicates
at the s word and a separate [0,+Lp] window at the e
word. At the s window, we also codify the named enti-
ties already recognized at the left context, capturing their
category and relative position. Furthermore, we represent
the (s,e) phrase by evaluating the predicates without cap-
turing the relative position in the features. In particular,
2Functional words are determiners and prepositions which
typically appear inside NEs.
for the words within (s,e) we evaluate the form, affixes
and type patterns of sizes 2, 3 and 4. We also evaluate the
complete concatenated form of the phrase and the word
type pattern spanning the whole phrase. Finally, we make
use of a gazetteer to capture possible NE categories of the
whole NE form and each single word within it.
5 Experiments and Results
A list of functional words was automatically extracted
from each language training set, selecting those lower-
cased words within NEs appearing 3 times or more. For
each language, we also constructed a gazetteer with the
NEs in the training set. When training, only a random
40% of the entries was considered.
We performed parameter tuning on the English lan-
guage. Concerning the features, we set the window sizes
(Lw and Lp) to 3 (we tested 2 and 3) , and we did not con-
sidered features occurring less than 5 times in the data.
When moving to German, we found better to work with
lemmas instead of word forms.
Concerning the learning algorithm, we evaluated ker-
nel degrees from 1 to 5. Degrees 2 and 3 performed some-
what better than others, and we chose degree 2. We then
ran the algorithm through the English training set for up
to five epochs, and through the German training set for up
to 3 epochs. 3 On both languages, the performance was
still slightly increasing while visiting more training sen-
tences. Unfortunately, we were not able to run the algo-
rithm until performance was stable. Table 1 summarizes
the obtained results on all sets. Clearly, the NERC task
on English is much easier than on German. Figures indi-
cate that the moderate performance on German is mainly
caused by the low recall, specially for ORG and MISC en-
tities. It is interesting to note that while in English the
performance is much better on the development set, in
German we achieve better results on the test set. This
seems to indicate that the difference in performance be-
tween development and test sets is due to irregularities
in the NEs that appear in each set, rather than overfitting
problems of our learning strategy.
The general performance of phrase recognition system
we present is fairly good, and we think it is competitive
with state-of-the-art named entity extraction systems.
Acknowledgments
This research has been partially funded by the European
Commission (Meaning, IST-2001-34460) and the Span-
ish Research Dept. (Hermes, TIC2000-0335-C03-02; Pe-
tra - TIC2000-1735-C02-02). Xavier Carreras holds a
grant by the Catalan Government Research Department.
3Implemented in PERL and run on a Pentium IV (Linux,
2.5GHz, 512Mb) it took about 120 hours for English and 70
hours for German.
English devel. Precision Recall Fβ=1
LOC 90.77% 93.63% 92.18
MISC 91.98% 80.80% 86.03
ORG 86.02% 83.52% 84.75
PER 91.37% 90.77% 91.07
Overall 90.06% 88.47% 89.26
English test Precision Recall Fβ=1
LOC 86.66% 89.15% 87.88
MISC 84.90% 72.08% 77.97
ORG 82.73% 77.60% 80.09
PER 88.25% 86.39% 87.31
Overall 85.81% 82.84% 84.30
German devel. Precision Recall Fβ=1
LOC 75.21% 67.32% 71.05
MISC 76.90% 42.18% 54.48
ORG 76.80% 47.22% 58.48
PER 76.87% 60.96% 67.99
Overall 76.36% 55.06% 63.98
German test Precision Recall Fβ=1
LOC 72.89% 65.22% 68.84
MISC 67.14% 42.09% 51.74
ORG 77.67% 42.30% 54.77
PER 87.23% 70.88% 78.21
Overall 77.83% 58.02% 66.48
Table 1: Results obtained for the development and the
test data sets for the English and German languages.

References
X. Carreras, L. M`arquez, V. Punyakanok, and D. Roth.
2002. Learning and Inference for Clause Identifica-
tion. In Proceedings of the 14th European Conference
on Machine Learning, ECML, Helsinki, Finland.
M. Collins. 2002. Discriminative Training Meth-
ods for Hidden Markov Models: Theory and Experi-
ments Perceptron Algorithms. In Proceedings of the
EMNLP’02.
K. Crammer and Y. Singer. 2003. A Family of Additive
Online Algorithms for Category Ranking. Journal of
Machine Learning Research, 3:1025–1058.
Y. Freund and R. E. Schapire. 1999. Large Margin Clas-
sification Using the Perceptron Algorithm. Machine
Learning, 37(3):277–296.
V. Punyakanok and D. Roth. 2001. The Use of Clas-
sifiers in Sequential Inference. In Proceedings of the
NIPS-13.
