Early Results for
Named Entity Recognition with Conditional Random Fields,
Feature Induction and Web-Enhanced Lexicons
Andrew McCallum and Wei Li
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA 01003
{mccallum,weili}@cs.umass.edu
1 Introduction
Models for many natural language tasks benefit from the
flexibility to use overlapping, non-independent features.
For example, the need for labeled data can be drastically
reduced by taking advantage of domain knowledge in
the form of word lists, part-of-speech tags, character n-
grams, and capitalization patterns. While it is difficult to
capture such inter-dependent features with a generative
probabilistic model, conditionally-trained models, such
as conditional maximum entropy models, handle them
well. There has been significant work with such mod-
els for greedy sequence modeling in NLP (Ratnaparkhi,
1996; Borthwick et al., 1998).
Conditional Random Fields (CRFs) (Lafferty et al.,
2001) are undirected graphical models, a special case of
which correspond to conditionally-trained finite state ma-
chines. While based on the same exponential form as
maximum entropy models, they have efficient procedures
for complete, non-greedy finite-state inference and train-
ing. CRFs have shown empirical successes recently in
POS tagging (Lafferty et al., 2001), noun phrase segmen-
tation (Sha and Pereira, 2003) and Chinese word segmen-
tation (McCallum and Feng, 2003).
Given these models’ great flexibility to include a wide
array of features, an important question that remains is
what features should be used? For example, in some
cases capturing a word tri-gram is important, however,
there is not sufficient memory or computation to include
all word tri-grams. As the number of overlapping atomic
features increases, the difficulty and importance of con-
structing only certain feature combinations grows.
This paper presents a feature induction method for
CRFs. Founded on the principle of constructing only
those feature conjunctions that significantly increase log-
likelihood, the approach builds on that of Della Pietra et
al (1997), but is altered to work with conditional rather
than joint probabilities, and with a mean-field approxi-
mation and other additional modifications that improve
efficiency specifically for a sequence model. In compari-
son with traditional approaches, automated feature induc-
tion offers both improved accuracy and significant reduc-
tion in feature count; it enables the use of richer, higher-
order Markov models, and offers more freedom to liber-
ally guess about which atomic features may be relevant
to a task.
Feature induction methods still require the user to cre-
ate the building-block atomic features. Lexicon member-
ship tests are particularly powerful features in natural lan-
guage tasks. The question is where to get lexicons that are
relevant for the particular task at hand?
This paper describes WebListing, a method that obtains
seeds for the lexicons from the labeled data, then uses the
Web, HTML formatting regularities and a search engine
service to significantly augment those lexicons. For ex-
ample, based on the appearance of Arnold Palmer in the
labeled data, we gather from the Web a large list of other
golf players, including Tiger Woods (a phrase that is dif-
ficult to detect as a name without a good lexicon).
We present results on the CoNLL-2003 named entity
recognition (NER) shared task, consisting of news arti-
cles with tagged entities PERSON, LOCATION, ORGANI-
ZATION and MISC. The data is quite complex; for exam-
ple the English data includes foreign person names (such
as Yayuk Basuki and Innocent Butare), a wide diversity of
locations (including sports venues such as The Oval, and
rare location names such as Nirmal Hriday), many types
of organizations (from company names such as 3M, to
acronyms for political parties such as KDP, to location
names used to refer to sports teams such as Cleveland),
and a wide variety of miscellaneous named entities (from
software such as Java, to nationalities such as Basque, to
sporting competitions such as 1,000 Lakes Rally).
On this, our first attempt at a NER task, with just a few
person-weeks of effort and little work on development-
set error analysis, our method currently obtains overall
English F1 of 84.04% on the test set by using CRFs, fea-
ture induction and Web-augmented lexicons. German F1
using very limited lexicons is 68.11%.
2 Conditional Random Fields
Conditional Random Fields (CRFs) (Lafferty et al., 2001)
are undirected graphical models used to calculate the con-
ditional probability of values on designated output nodes
given values assigned to other designated input nodes.
In the special case in which the output nodes of the
graphical model are linked by edges in a linear chain,
CRFs make a first-order Markov independence assump-
tion, and thus can be understood as conditionally-trained
finite state machines (FSMs). In the remainder of this
section we introduce the likelihood model, inference and
estimation procedures for CRFs.
Let o = 〈o1,o2,...oT〉 be some observed input data
sequence, such as a sequence of words in text in a doc-
ument, (the values on n input nodes of the graphical
model). Let S be a set of FSM states, each of which
is associated with a label, l ∈ L, (such as ORG). Let
s = 〈s1,s2,...sT〉 be some sequence of states, (the val-
ues on T output nodes). By the Hammersley-Clifford the-
orem, CRFs define the conditional probability of a state
sequence given an input sequence to be
PΛ(s|o) = 1Z
o
exp
parenleftBigg Tsummationdisplay
t=1
summationdisplay
k
λkfk(st−1,st,o,t)
parenrightBigg
,
where Zo is a normalization factor over all state se-
quences, fk(st−1,st,o,t) is an arbitrary feature func-
tion over its arguments, and λk is a learned weight for
each feature function. A feature function may, for exam-
ple, be defined to have value 0 in most cases, and have
value 1 if and only if st−1 is state #1 (which may have
label OTHER), and st is state #2 (which may have la-
bel LOCATION), and the observation at position t in o
is a word appearing in a list of country names. Higher λ
weights make their corresponding FSM transitions more
likely, so the weight λk in this example should be pos-
itive. More generally, feature functions can ask pow-
erfully arbitrary questions about the input sequence, in-
cluding queries about previous words, next words, and
conjunctions of all these, and fk(·) can range −∞...∞.
CRFs define the conditional probability of a label
sequence based on total probability over the state se-
quences, PΛ(l|o) = summationtexts:l(s)=l PΛ(s|o), where l(s) is
the sequence of labels corresponding to the labels of the
states in sequence s.
Note that the normalization factor, Zo, is the sum
of the “scores” of all possible state sequences, Zo =summationtext
s∈ST exp
parenleftBigsummationtextT
t=1
summationtext
k λkfk(st−1,st,o,t)
parenrightBig
, and that
the number of state sequences is exponential in the in-
put sequence length, T. In arbitrarily-structured CRFs,
calculating the normalization factor in closed form is
intractable, but in linear-chain-structured CRFs, as in
forward-backward for hidden Markov models (HMMs),
the probability that a particular transition was taken be-
tween two CRF states at a particular position in the input
sequence can be calculated efficiently by dynamic pro-
gramming. We define slightly modified forward values,
αt(si), to be the “unnormalized probability” of arriving
in state si given the observations 〈o1,...ot〉. We set α0(s)
equal to the probability of starting in each state s, and
recurse:
αt+1(s) =
summationdisplay
sprime
αt(sprime)exp
parenleftBiggsummationdisplay
k
λkfk(sprime,s,o,t)
parenrightBigg
.
The backward procedure and the remaining details of
Baum-Welch are defined similarly. Zo is thensummationtexts αT(s).
The Viterbi algorithm for finding the most likely state
sequence given the observation sequence can be corre-
spondingly modified from its HMM form.
2.1 Training CRFs
The weights of a CRF, Λ={λ,...}, are set to maximize the
conditional log-likelihood of labeled sequences in some
training set, D = {〈o,l〉(1),...〈o,l〉(j),...〈o,l〉(N)}:
LΛ =
Nsummationdisplay
j=1
log
parenleftBig
PΛ(l(j)|o(j))
parenrightBig
−
summationdisplay
k
λ2k
2σ2,
where the second sum is a Gaussian prior over parameters
(with variance σ) that provides smoothing to help cope
with sparsity in the training data.
When the training labels make the state sequence un-
ambiguous (as they often do in practice), the likelihood
function in exponential models such as CRFs is con-
vex, so there are no local maxima, and thus finding the
global optimum is guaranteed. It has recently been shown
that quasi-Newton methods, such as L-BFGS, are signifi-
cantly more efficient than traditional iterative scaling and
even conjugate gradient (Malouf, 2002; Sha and Pereira,
2003). This method approximates the second-derivative
of the likelihood by keeping a running, finite-sized win-
dow of previous first-derivatives.
L-BFGS can simply be treated as a black-box opti-
mization procedure, requiring only that one provide the
first-derivative of the function to be optimized. Assum-
ing that the training labels on instance j make its state
path unambiguous, let s(j) denote that path, and then the
first-derivative of the log-likelihood is
δL
δλk =


Nsummationdisplay
j=1
Ck(s(j),o(j))

−


Nsummationdisplay
j=1
summationdisplay
s
PΛ(s|o(j))Ck(s,o(j))

− λkσ2
where Ck(s,o) is the “count” for feature k given s
and o, equal to summationtextTt=1 fk(st−1,st,o,t), the sum of
fk(st−1,st,o,t) values for all positions, t, in the se-
quence s. The first two terms correspond to the differ-
ence between the empirical expected value of feature fk
and the model’s expected value: ( ˜E[fk]−EΛ[fk])N. The
last term is the derivative of the Gaussian prior.
3 Efficient Feature Induction for CRFs
Typically the features, fk, are based on some number of
hand-crafted atomic observational tests (such as word is
capitalized or word is “said”, or word appears in lexi-
con of country names), and a large collection of features
is formed by making conjunctions of the atomic tests in
certain user-defined patterns; (for example, the conjunc-
tions consisting of all tests at the current sequence po-
sition conjoined with all tests at the position one step
ahead—specifically, for instance, current word is capi-
talized and next word is “Inc”). There can easily be
over 100,000 atomic tests (mostly based on tests for the
identity of words in the vocabulary), and ten or more
shifted-conjunction patterns—resulting in several million
features (Sha and Pereira, 2003). This large number of
features can be prohibitively expensive in memory and
computation; furthermore many of these features are ir-
relevant, and others that are relevant are excluded.
In response, we wish to use just those time-shifted
conjunctions that will significantly improve performance.
We start with no features, and over several rounds of fea-
ture induction: (1) consider a set of proposed new fea-
tures, (2) select for inclusion those candidate features that
will most increase the log-likelihood of the correct state
path s(j), and (3) train weights for all features. The pro-
posed new features are based on the hand-crafted obser-
vational tests—consisting of singleton tests, and binary
conjunctions of tests with each other and with features
currently in the model. The later allows arbitrary-length
conjunctions to be built. The fact that not all singleton
tests are included in the model gives the designer great
freedom to use a very large variety of observational tests,
and a large window of time shifts.
To consider the effect of adding a new feature, define
the new sequence model with additional feature, g, hav-
ing weight µ, to be
PΛ+g,µ(s|o) =
PΛ(s|o)exp
parenleftBigsummationtextT
t=1 µg(st−1,st,o,t)
parenrightBig
Zo(Λ,g,µ) ;
Zo(Λ,g,µ) def=summationtextsprime PΛ(sprime|o)exp(summationtextTt=1 µg(sprimet−1,sprimet,o,t))
in the denominator is simply the additional portion of
normalization required to make the new function sum to
1 over all state sequences.
Following (Della Pietra et al., 1997), we efficiently as-
sess many candidate features in parallel by assuming that
the λ parameters on all included features remain fixed
while estimating the gain, G(g), of a candidate feature, g,
based on the improvement in log-likelihood it provides,
GΛ(g) = maxµ GΛ(g,µ) = maxµ LΛ+gµ −LΛ.
where LΛ+gµ includes −µ2/2σ2.
In addition, we make this approach tractable for CRFs
with two further reasonable and mutually-supporting ap-
proximations specific to CRFs. (1) We avoid dynamic
programming for inference in the gain calculation with
a mean-field approximation, removing the dependence
among states. (Thus we transform the gain from a se-
quence problem to a token classification problem. How-
ever, the original posterior distribution over states given
each token, PΛ(s|o) = αt(s|o)βt+1(s|o)/Zo, is still
calculated by dynamic programming without approxima-
tion.) Furthermore, we can calculate the gain of aggre-
gate features irrespective of transition source, g(st,o,t),
and expand them after they are selected. (2) In many
sequence problems, the great majority of the tokens are
correctly labeled even in the early stages of training. We
significantly gain efficiency by including in the gain cal-
culation only those tokens that are mislabeled by the cur-
rent model. Let {o(i) : i = 1...M} be those tokens, and
o(i) be the input sequence in which the ith error token
occurs at position t(i). Then algebraic simplification us-
ing these approximations and previous definitions gives
GΛ(g,µ) =
Msummationdisplay
i=1
log
parenleftBigg
expparenleftbigµg(st(i),o(i),t(i))parenrightbig
Zo(i)(Λ,g,µ)
parenrightBigg
− µ
2
2σ2
= Mµ ˜E[g] −
Msummationdisplay
i=1
log(EΛ[exp(µg)|o(i)] − µ
2
2σ2,
where Zo(i)(Λ,g,µ) (with non-bold o) is simplysummationtext
s PΛ(s|o(i))exp(µg(s,o(i),t(i))). The optimal val-ues of the µ’s cannot be solved in closed form, but New-
ton’s method finds them all in about 12 quick iterations.
There are two additional important modeling choices:
(1) Because we expect our models to still require sev-
eral thousands of features, we save time by adding many
of the features with highest gain each round of induction
rather than just one; (including a few redundant features
is not harmful). (2) Because even models with a small se-
lect number of features can still severely overfit, we train
the model with just a few BFGS iterations (not to con-
vergence) before performing the next round of feature in-
duction. Details are in (McCallum, 2003).
4 Web-augmented Lexicons
Some general-purpose lexicons, such a surnames and lo-
cation names, are widely available, however, many nat-
ural language tasks will benefit from more task-specific
lexicons, such as lists of soccer teams, political parties,
NGOs and English counties. Creating new lexicons en-
tirely by hand is tedious and time consuming.
Using a technique we call WebListing, we build lexi-
cons automatically from HTML data on the Web. Previ-
ous work has built lexicons from fixed corpora by deter-
mining linguistic patterns for the context in which rele-
vant words appear (Collins and Singer, 1999; Jones et al.,
1999). Rather than mining a small corpus, we gather data
from nearly the entire Web; rather than relying on fragile
linguistic context patterns, we leverage robust formatting
regularities on the Web. WebListing finds co-occurrences
of seed terms that appear in an identical HTML format-
ting pattern, and augments a lexicon with other terms on
the page that share the same formatting. Our current im-
plementation uses GoogleSets, which we understand to
be a simple implementation of this approach based on us-
ing HTML list items as the formatting regularity. We are
currently building a more sophisticated replacement.
5 Results
To perform named entity extraction on the news articles
in the CoNLL-2003 English shared task, several families
of features are used, all time-shifted by -2, -1, 0, 1, 2: (a)
the word itself, (b) 16 character-level regular expressions,
mostly concerning capitalization and digit patterns, such
as A, A+, Aa+, Aa+Aa*, A., D+, where A, a and D indi-
cate the regular expressions [A-Z], [a-z] and [0-9],
(c) 8 lexicons entered by hand, such as honorifics, days
and months, (d) 15 lexicons obtained from specific web
sites, such as countries, publicly-traded companies, sur-
names, stopwords, and universities, (e) 25 lexicons ob-
tained by WebListing (including people names, organi-
zations, NGOs and nationalities), (f) all the above tests
with prefix firstmention from any previous duplicate of
the current word, (if capitalized). A small amount of
hand-filtering was performed on some of the WebList-
ing lexicons. Since GoogleSets’ support for non-English
is severely limited, only 5 small lexicons were used for
German; but character bi- and tri-grams were added.
A Java-implemented, first-order CRF was trained for
about 12 hours on a 1GHz Pentium with a Gaussian prior
variance of 0.5, inducing 1000 or fewer features (down
to a gain threshold of 5.0) each round of 10 iterations of
L-BFGS. Candidate conjunctions are limited to the 1000
atomic and existing features with highest gain. Perfor-
mance results for each of the entity classes can be found
in Figure 1. The model achieved an overall F1 of 84.04%
on the English test set using 6423 features. (Using a set
of fixed conjunction patterns instead of feature induction
results in F1 73.34%, with about 1 million features; trial-
and-error tuning the fixed patterns would likely improve
this.) Accuracy gains are expected from experimentation
with the induction parameters and improved WebListing.
Acknowledgments
We thank John Lafferty, Fernando Pereira, Andres Corrada-
Emmanuel, Drew Bagnell and Guy Lebanon, for helpful
input. This work was supported in part by the Center
for Intelligent Information Retrieval, SPAWARSYSCEN-SD
grant numbers N66001-99-1-8912 and N66001-02-1-8903, Ad-
vanced Research and Development Activity under contract
number MDA904-01-C-0984, and DARPA contract F30602-
01-2-0566.

References
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998.
Exploiting diverse knowledge sources via maximum entropy
in named entity recognition. In Proceedings of the Sixth
Workshop on Very Large Corpora, Association for Compu-
tational Linguistics.
M. Collins and Y. Singer. 1999. Unsupervised models for
named entity classification. In Proceedings of the Joint SIG-
DAT Conference on Empirical Methods in Natural Language
Processing and Very Large Corpora.
Stephen Della Pietra, Vincent J. Della Pietra, and John D. Laf-
ferty. 1997. Inducing Features of Random Fields. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
19(4):380–393.
Rosie Jones, Andrew McCallum, Kamal Nigam, and Ellen
Riloff. 1999. Bootstrapping for Text Learning Tasks. In
IJCAI-99 Workshop on Text Mining: Foundations, Tech-
niques and Applications.
John Lafferty, Andrew McCallum, and Fernando Pereira. 2001.
Conditional Random Fields: Probabilistic Models for Seg-
menting and Labeling Sequence Data. In Proc. ICML.
Robert Malouf. 2002. A comparison of algorithms for max-
imum entropy parameter estimation. In Sixth Workshop on
Computational Language Learning (CoNLL-2002).
Andrew McCallum and Fang-Fang Feng. 2003. Chinese
Word Segmentation with Conditional Random Fields and In-
tegrated Domain Knowledge. In Unpublished Manuscript.
Andrew McCallum. 2003. Efficiently Inducing Features of
Conditional Random Fields. In Nineteenth Conference on
Uncertainty in Artificial Intelligence (UAI03). (Submitted).
Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for
Part-of-Speech Tagging. In Eric Brill and Kenneth Church,
editors, Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing, pages 133–142. Asso-
ciation for Computational Linguistics.
Fei Sha and Fernando Pereira. 2003. Shallow Parsing with
Conditional Random Fields. In Proceedings of Human Lan-
guage Technology, NAACL.
