Maximum Entropy Models for Named Entity Recognition
Oliver Bendera0 and Franz Josef Ocha1 and Hermann Neya0
a0 Lehrstuhl f¨ur Informatik VI a1 Information Sciences Institute
Computer Science Department University of Southern California
RWTH Aachen - University of Technology Marina del Rey, CA 90292
D-52056 Aachen, Germany och@isi.edu
a2 bender,ney
a3 @cs.rwth-aachen.de
Abstract
In this paper, we describe a system that applies
maximum entropy (ME) models to the task of
named entity recognition (NER). Starting with
an annotated corpus and a set of features which
are easily obtainable for almost any language,
we first build a baseline NE recognizer which
is then used to extract the named entities and
their context information from additional non-
annotated data. In turn, these lists are incor-
porated into the final recognizer to further im-
prove the recognition accuracy.
1 Introduction
In this paper, we present an approach for extracting the
named entities (NE) of natural language inputs which
uses the maximum entropy (ME) framework (Berger et
al., 1996). The objective can be described as follows.
Given a natural input sequence a4a6a5a7a9a8a10a4 a7a12a11a13a11a14a11a4a16a15 a11a13a11a14a11a4
a5
we
choose the NE tag sequence a17 a5a7 a8a18a17 a7a19a11a14a11a14a11a17a20a15 a11a13a11a14a11a17
a5
with the
highest probability among all possible tag sequences:
a21
a17 a5a7 a8 a22a12a23a25a24a27a26a28a22a30a29
a31a33a32a34 a35a37a36a39a38a41a40
a17 a5a7a43a42a4 a5a7a39a44a46a45
a11
The argmax operation denotes the search problem, i.e.
the generation of the sequence of named entities. Ac-
cording to the CoNLL-2003 competition, we concentrate
on four types of named entities: persons (PER), locations
(LOC), organizations (ORG), and names of miscellaneous
entities (MISC) that do not belong to the previous three
groups, e.g.
[PER Clinton] ’s [ORG Ballybunion] fans in-
vited to [LOC Chicago] .
Additionally, the task requires the processing of two
different languages from which only English was spec-
ified before the submission deadline. Therefore, the
system described avoids relying on language-dependent
knowledge but instead uses a set of features which are
easily obtainable for almost any language.
The remainder of the paper is organized as follows: in
section 2, we outline the ME framework and specify the
features that were used for the experiments. We describe
the training and search procedure of our approach. Sec-
tion 3 presents experimental details and shows results ob-
tained on the English and German test sets. Finally, sec-
tion 4 closes with a summary and an outlook for future
work.
2 Maximum Entropy Models
For our approach, we directly factorize the posterior
probability and determine the corresponding NE tag
for each word of an input sequence. We assume that
the decisions only depend on a limited window of
a4
a15a12a47a46a48
a15a50a49a51a48
a8a52a4a53a15a50a49a54a48
a11a14a11a13a11
a4a53a15a12a47a55a48 around the current word a4a56a15 and
on the two predecessor tags. Thus, we obtain the follow-
ing second-order model:
a36a39a38a41a40
a17 a5a7 a42a4 a5a7 a44 a8
a5a57
a15a27a58
a7
a36a39a38a41a40
a17a20a15a59a42a17
a15a41a49
a7
a7 a60 a4 a5a7 a44
a8
ma61a63a62a65a64a33a66
a5a57
a15a27a58
a7a68a67
a40
a17 a15 a42a17
a15a50a49
a7
a15a50a49a51a48
a60 a4
a15a27a47a55a48
a15a50a49a54a48
a44
a11
A well-founded framework for directly modeling the
posterior probability
a67
a40
a17 a15 a42a17
a15a50a49
a7
a15a50a49a54a48
a60 a4
a15a12a47a55a48
a15a50a49a54a48
a44 is maximum en-
tropy (Berger et al., 1996). In this framework, we have
a set of a69 feature functions a70a51a71
a40
a17
a15a50a49
a7
a15a50a49a54a48
a60 a17 a15 a60 a4
a15a12a47a55a48
a15a50a49a54a48
a44 a60a73a72 a8
a74
a60
a11a75a11a76a11
a60 a69 . For each feature function a70a51a71 , there exists a
model parameter a77 a71 . The posterior probability can then
be modeled as follows:
Input Sequencea0
a1a2 a3a4
Preprocessing
a0
Global Search
a5
a6a8a7a9a11a10a13a12a15a14a17a16a19a18a20a12a15a21
a22 a23a24
a25
a7
a26
a27a29a28
a9 a30a32a31a34a33
a24a11a35
a6
a27a37a36
a6
a27a8a38
a9
a27a8a38a40a39a42a41 a43
a27a45a44a19a39
a27a8a38a40a39a8a46a47
a48 a9a50a49a17a51a52a9
a35
a6
a27a8a38
a9
a27a8a38a40a39a8a41
a6
a27 a41 a43
a27a45a44a19a39
a27a8a38a40a39a8a46
a48
a39
a49a17a51
a39 a35
a6
a27a8a38
a9
a27a8a38a40a39a8a41
a6
a27 a41 a43
a27a45a44a19a39
a27a8a38a40a39a8a46
a53
a53
a53 a54
a54
a54
a48a42a55 a49a17a51 a55
a35
a6
a27a8a38
a9
a27a8a38a40a39a56a41
a6
a27 a41 a43
a27a29a44a19a39
a27a8a38a40a39a8a46
a0
a1a2 a3a4
Postprocessing
a0
Tag Sequence
Figure 1: Architecture of the maximum entropy model
approach.
a67a58a57a11a59
a34
a40
a17a20a15a46a42a17
a15a50a49
a7
a15a50a49a54a48
a60 a4
a15a12a47a55a48
a15a50a49a54a48
a44
a8
exp a60
a61
a62
a71 a58 a7
a77 a71 a70 a71
a40
a17
a15a50a49
a7
a15a50a49a54a48
a60 a17a75a15 a60 a4
a15a12a47a46a48
a15a50a49a51a48
a44a32a63
a62
a31a37a64
exp a60
a61
a62
a71 a58 a7
a77 a71 a70 a71
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17a66a65 a60 a4
a15a27a47a55a48
a15a50a49a54a48
a44a32a63
a11 (1)
The architecture of the ME approach is summarized in
Figure 1.
As for the CoNLL-2003 shared task, the data sets often
provide additional information like part-of-speech (POS)
tags. In order to take advantage of these knowledge
sources, our system is able to process several input se-
quences at the same time.
2.1 Feature Functions
We have implemented a set of binary valued feature func-
tions for our system:
Lexical features: The words a4
a15a12a47a55a48
a15a50a49a51a48
are compared to a
vocabulary. Words which are seen less than twice in the
training data are mapped onto an ’unknown word’. For-
mally, the feature
a70a68a67a70a69
a62
a69
a31
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17a75a15 a60 a4
a15a12a47a46a48
a15a50a49a51a48
a44 a8 a71
a40
a4a53a15a12a47
a62
a60 a4a6a44a73a72a74a71
a40
a17a20a15 a60a76a75 a44 a60
a77a79a78a81a80a83a82a85a84
a60
a11a13a11a14a11
a60
a84a20a86
a60
will fire if the word a4 a15a12a47
a62
matches the vocabulary entry
a4 and if the prediction for the current NE tag equals a17 .
a71
a40
a72 a60 a72 a44 denotes the Kronecker-function.
Word features: Word characteristics are covered by
the word features, which test for:
- Capitalization: These features will fire if a4 a15 is cap-
italized, has an internal capital letter, or is fully cap-
italized.
- Digits and numbers: ASCII digit strings and number
expressions activate these features.
- Pre- and suffixes: If the prefix (suffix) of a4a56a15 equals
a given prefix (suffix), these features will fire.
Transition features: Transition features model the de-
pendence on the two predecessor tags:
a70
a31 a64
a69
a62
a69
a31
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17a20a15 a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44 a8 a71
a40
a17a20a15a50a49
a62
a60 a17 a65 a44a73a72a87a71
a40
a17a75a15 a60a76a75 a44 a60
a77a79a78a81a80 a74
a60
a84a20a86
a11
Prior features: The single named entity priors are in-
corporated by prior features. They just fire for the cur-
rently observed NE tag:
a70
a31
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17 a15 a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44 a8 a71
a40
a17 a15 a60 a17a76a44
a11
Compound features: Using the feature functions de-
fined so far, we can only specify features that refer to
a single word or tag. To enable also word phrases and
word/tag combinations, we introduce the following com-
pound features:
a70a89a88a91a90
a34
a69
a62
a34a76a92
a69a94a93a94a93a94a93a94a69 a88a91a90a91a95 a69
a62
a95
a92
a69
a31
a40
a17
a15a50a49
a7
a15a50a49a54a48
a60 a17 a15 a60 a4
a15a27a47a55a48
a15a50a49a54a48
a44
a8 a96
a57
a97
a58
a7
a70 a90a91a98 a69
a62
a98 a69
a31
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17 a15 a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44 a60
a99
a97
a78a100a80
a4 a60 a17a66a65
a86
a60
a77
a97
a78a101a80a83a82a85a84
a60
a11a13a11a14a11
a60
a84a20a86
a11
Dictionary features: Given a list a102 of named entities,
the dictionary features check whether or not an entry of
a102 occurs within the current window. Formally,
a70a104a103a89a69
a31
a40
a17
a15a50a49
a7
a15a50a49a51a48
a60 a17a20a15 a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44
a8 entryOccurs
a40
a102 a60 a4
a15a12a47a46a48
a15a50a49a51a48
a44a73a72a74a71
a40
a17a20a15 a60a91a75 a44
a11
Respectively, the dictionary features fire if an entry of
a context list appears beside or around the current word
position a4 a15 .
2.2 Feature Selection
Feature selection plays a crucial role in the ME frame-
work. In our system, we use simple count-based feature
reduction. Given a threshold a105 , we only include those
features that have been observed on the training data at
least a105 times. Although this method does not guarantee
to obtain a minimal set of features, it turned out to per-
form well in practice.
Experiments were carried out with different thresholds.
It turned out that for the NER task, a threshold of a74 for the
English data and a84 for the German corpus achieved the
best results for all features, except for the prefix and suffix
features, for which a threshold of a106 (a74a108a107 resp.) yielded best
results.
2.3 Training
For training purposes, we consider the set of manually an-
notated and segmented training sentences to form a single
long sentence. As training criterion, we use the maximum
class posterior probability criterion:
a21
a77
a61
a7 a8 a22a12a23a25a24a27a26a28a22a30a29
a57a108a59
a34
a0
a5a62
a15a12a58
a7a2a1a4a3
a24
a67 a57a11a59
a34
a40
a17a20a15a55a42a17
a15a50a49
a7
a15a50a49a51a48
a60 a4
a15a12a47a46a48
a15a50a49a51a48
a44a6a5
a11
This corresponds to maximizing the likelihood of the ME
model. Since the optimization criterion is convex, there is
only a single optimum and no convergence problems oc-
cur. To train the model parameters a77 a61a7 we use the Gen-
eralized Iterative Scaling (GIS) algorithm (Darroch and
Ratcliff, 1972).
In practice, the training procedure tends to result in an
overfitted model. To avoid overfitting, (Chen and Rosen-
feld, 1999) have suggested a smoothing method where a
Gaussian prior on the parameters is assumed. Instead of
maximizing the probability of the training data, we now
maximize the probability of the training data times the
prior probability of the model parameters:
a21
a77
a61
a7 a8 a22a12a23a25a24a27a26a28a22a30a29
a57a11a59
a34
a0
a67
a40
a77
a61
a7 a44a70a72
a5a62
a15a27a58
a7a19a67a89a57a108a59
a34
a40
a17 a15 a42a17
a15a50a49
a7
a15a50a49a54a48
a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44 a5 a60
where
a67
a40
a77
a61
a7 a44 a8
a57
a71
a74
a7
a84a9a8a11a10a13a12 a29a15a14a17a16
a82 a77
a48
a71
a84a18a10
a48a20a19
a11
This method tries to avoid very large lambda values and
avoids that features that occur only once for a specific
class get value infinity. Note that there is only one pa-
rameter a10 for all model parameters a77 a61a7 .
2.4 Search
In the test phase, the search is performed using the so-
called maximum approximation, i.e. the most likely se-
quence of named entities a21a17 a5a7 is chosen among all possible
sequences a17 a5a7 :
a21
a17 a5a7 a8 a22a27a23a63a24 a26a28a22a30a29
a31a33a32a34 a21 a36a39a38a41a40
a17 a5a7 a42a4 a5a7 a44a23a22
a8 a22a27a23a63a24 a26a28a22a30a29
a31a33a32a34 a21
a5a57
a15a12a58
a7a19a67a89a57a11a59
a34
a40
a17 a15 a42a17
a15a50a49
a7
a15a50a49a54a48
a60 a4
a15a12a47a55a48
a15a41a49a51a48
a44a24a22
a11
Therefore, the time-consuming renormalization in Eq. 1
is not needed during search. We run a Viterbi search to
find the highest probability sequence (Borthwick et al.,
1998).
3 Experiments
Experiments were performed on English and German test
sets. The English data was derived from the Reuters cor-
pus1 while the German test sets were extracted from the
ECI Multilingual Text corpus. The data sets contain to-
kens (words and punctuation marks), information about
the sentence boundaries, as well as the assigned NE tags.
Additionally, a POS tag and a syntactic chunk tag were
assigned to each token. On the tag level, we distinguish
five tags (the four NE tags mentioned above and a filler
tag).
3.1 Incorporating Lists of Names and
Non-annotated Data
For the English task, extra lists of names were provided,
and for both languages, additional non-annotated data
was supplied. Hence, the challenge was to find ways of
incorporating this information. Our system aims at this
challenge via the use of dictionary features.
While the provided lists could straightforward be inte-
grated, the raw data was processed in three stages:
1. Given the annotated training data, we used all fea-
tures except the dictionary ones to build a first base-
line NE recognizer.
2. Applying this recognizer, the non-annotated data
was processed and all named entities plus contexts
(up to three words beside the classified NE and the
two surrounding words) were extracted and stored
as additional lists.
3. These lists could again be integrated straightfor-
ward. It turned out that a threshold of five yielded
best results for both the lists of named entities as
well as for the context information.
3.2 Results
Table 1 and Table 2 present the results obtained on the
development and test sets. For both languages, 1 000 GIS
iterations were performed and the Gaussian prior method
was applied.
Test Set Precision Recall Fa25 a58 a7
English devel. 90.01% 88.52% 89.26
English test 84.45% 82.90% 83.67
German devel. 73.60% 57.73% 64.70
German test 76.12% 60.74% 67.57
Table 1: Overall performance of the baseline system on
the development and test sets in English and German.
1The Reuters corpus was kindly provided by Reuters Lim-
ited.
 86
 87
 88
 89
 0  2  4  6  8  10
F-Measure [%]
standard deviation
smoothed
no smoothing
Figure 2: Results of the baseline system for different
smoothing parameters.
As can be derived from table 1, our baseline recog-
nizer clearly outperforms the CoNLL-2003 baseline (e.g.
a0
a25 a58
a7
a8a2a1a4a3
a11
a84a6a5 vs. a0
a25 a58
a7
a8a8a7
a74
a11
a74
a1 ). To investigate the
contribution of the Gaussian prior method, several exper-
iments were carried out for different standard deviation
parameters a10 . Figure 2 depicts the obtained F-Measures
in comparison to the performance of non-smoothed ME
models (a0 a25 a58 a7 a8a9a1 a5 a11a10 a7 ). The gain in performance is ob-
vious.
By incorporating the information extracted from the
non-annotated data our system is further improved. On
the German data, the results show a performance degra-
dation. The main reason for this is due to the capitaliza-
tion of German nouns. Therefore, refined lists of proper
names are necessary.
4 Summary
In conclusion, we have presented a system for the task of
named entity recognition that uses the maximum entropy
framework. We have shown that a baseline system based
on an annotated training set can be improved by incorpo-
rating additional non-annotated data.
For future investigations, we have to think about a
more sophisticated treatment of the additional informa-
tion. One promising possibility could be to extend our
system as follows: apply the baseline recognizer to an-
notate the raw data as before, but then use the output to
train a new recognizer. The scores of the new system are
incorporated as further features and the procedure is iter-
ated until convergence.

References
A. L. Berger, S. A. Della Pietra, and V. J. Della
Pietra. 1996. A maximum entropy approach to nat-
ural language processing. Computational Linguistics,
22(1):39–72, March.
A. Borthwick, J. Sterling, E. Agichtein, and R. Gr-
isham. 1998. NYU: Description of the MENE
named entity system as used in MUC-7. In Pro-
ceedings of the Seventh Message Understanding
Conference (MUC-7), 6 pages, Fairfax, VA, April.
http://www.itl.nist.gov/iaui/894.02/related projects/muc/.
S. Chen and R. Rosenfeld. 1999. A gaussian prior
for smoothing maximum entropy models. Technical
Report CMUCS-99-108, Carnegie Mellon University,
Pittsburgh, PA.
J. N. Darroch and D. Ratcliff. 1972. Generalized iter-
ative scaling for log-linear models. Annals of Mathe-
matical Statistics, 43:1470–1480.
