Named Entity Recognition: A Maximum Entropy Approach
Using Global Information
Hai Leong Chieu
DSO National Laboratories
20 Science Park Drive
Singapore 118230
chaileon@dso.org.sg
Hwee Tou Ng
Department of Computer Science
School of Computing
National University of Singapore
3 Science Drive 2
Singapore 117543
nght@comp.nus.edu.sg
Abstract
This paper presents a maximum entropy-based
named entity recognizer (NER). It differs from pre-
vious machine learning-based NERs in that it uses
information from the whole document to classify
each word, with just one classifier. Previous work
that involves the gathering of information from the
whole document often uses a secondary classifier,
which corrects the mistakes of a primary sentence-
based classifier. In this paper, we show that the
maximum entropy framework is able to make use
of global information directly, and achieves perfor-
mance that is comparable to the best previous ma-
chine learning-based NERs on MUC-6 and MUC-7
test data.
1 Introduction
Considerable amount of work has been done in re-
cent years on the named entity recognition task,
partly due to the Message Understanding Confer-
ences (MUC). A named entity recognizer (NER) is
useful in many NLP applications such as informa-
tion extraction, question answering, etc. On its own,
a NER can also provide users who are looking for
person or organization names with quick informa-
tion. In MUC-6 and MUC-7, the named entity task
is defined as finding the following classes of names:
person, organization, location, date, time, money,
and percent (Chinchor, 1998; Sundheim, 1995)
Machine learning systems in MUC-6 and MUC-
7 achieved accuracy comparable to rule-based sys-
tems on the named entity task.
Statistical NERs usually find the sequence of tags
that maximizes the probability a0a2a1a4a3a6a5a8a7a10a9 , where a7
is the sequence of words in a sentence, and a3
is the sequence of named-entity tags assigned to
the words in a7 . Attempts have been made to use
global information (e.g., the same named entity oc-
curring in different sentences of the same docu-
ment), but they usually consist of incorporating an
additional classifier, which tries to correct the er-
rors in the output of a first NER (Mikheev et al.,
1998; Borthwick, 1999). We propose maximizing
a0a2a1a4a3a6a5a8a7a12a11a14a13a16a15a18a17a19a9 , where a3 is the sequence of named-
entity tags assigned to the words in the sentence a7 ,
and a13a20a15a18a17 is the information that can be extracted
from the whole document containing a7 . Our sys-
tem is built on a maximum entropy classifier. By
making use of global context, it has achieved ex-
cellent results on both MUC-6 and MUC-7 official
test data. We will refer to our system as MENERGI
(Maximum Entropy Named Entity Recognizer us-
ing Global Information).
As far as we know, no other NERs have used in-
formation from the whole document (global) as well
as information within the same sentence (local) in
one framework. The use of global features has im-
proved the performance on MUC-6 test data from
90.75% to 93.27% (27% reduction in errors), and
the performance on MUC-7 test data from 85.22%
to 87.24% (14% reduction in errors). These results
are achieved by training on the official MUC-6 and
MUC-7 training data, which is much less training
data than is used by other machine learning systems
that worked on the MUC-6 or MUC-7 named entity
task (Bikel et al., 1997; Bikel et al., 1999; Borth-
wick, 1999).
We believe it is natural for authors to use abbre-
viations in subsequent mentions of a named entity
(i.e., first “President George Bush” then “Bush”).
As such, global information from the whole context
of a document is important to more accurately rec-
ognize named entities. Although we have not done
any experiments on other languages, this way of us-
ing global features from a whole document should
be applicable to other languages.
2 Related Work
Recently, statistical NERs have achieved results
that are comparable to hand-coded systems. Since
MUC-6, BBN's Hidden Markov Model (HMM)
based IdentiFinder (Bikel et al., 1997) has achieved
remarkably good performance. MUC-7 has also
seen hybrids of statistical NERs and hand-coded
systems (Mikheev et al., 1998; Borthwick, 1999),
notably Mikheev's system, which achieved the best
performance of 93.39% on the official NE test data.
MENE (Maximum Entropy Named Entity) (Borth-
wick, 1999) was combined with Proteus (a hand-
coded system), and came in fourth among all MUC-
7 participants. MENE without Proteus, however,
did not do very well and only achieved an F-
measure of 84.22% (Borthwick, 1999).
Among machine learning-based NERs, Identi-
Finder has proven to be the best on the official
MUC-6 and MUC-7 test data. MENE (without
the help of hand-coded systems) has been shown
to be somewhat inferior in performance. By using
the output of a hand-coded system such as Proteus,
MENE can improve its performance, and can even
outperform IdentiFinder (Borthwick, 1999).
Mikheev et al. (1998) did make use of informa-
tion from the whole document. However, their sys-
tem is a hybrid of hand-coded rules and machine
learning methods. Another attempt at using global
information can be found in (Borthwick, 1999). He
used an additional maximum entropy classifier that
tries to correct mistakes by using reference resolu-
tion. Reference resolution involves finding words
that co-refer to the same entity. In order to train
this error-correction model, he divided his training
corpus into 5 portions of 20% each. MENE is then
trained on 80% of the training corpus, and tested
on the remaining 20%. This process is repeated 5
times by rotating the data appropriately. Finally,
the concatenated 5 * 20% output is used to train
the reference resolution component. We will show
that by giving the first model some global features,
MENERGI outperforms Borthwick's reference res-
olution classifier. On MUC-6 data, MENERGI also
achieves performance comparable to IdentiFinder
when trained on similar amount of training data.
In Section 5, we try to compare results of
MENE, IdentiFinder, and MENERGI. However,
both MENE and IdentiFinder used more training
data than we did (we used only the official MUC-
6 and MUC-7 training data). On the MUC-6 data,
Bikel et al. (1997; 1999) do have some statistics that
show how IdentiFinder performs when the training
data is reduced. Our results show that MENERGI
performs as well as IdentiFinder when trained on
comparable amount of training data.
3 System Description
The system described in this paper is similar to the
MENE system of (Borthwick, 1999). It uses a max-
imum entropy framework and classifies each word
given its features. Each name class a3 is subdivided
into 4 sub-classes, i.e., N begin, N continue, N end,
and N unique. Hence, there is a total of 29 classes (7
name classes a0 4 sub-classes a1 1 not-a-name class).
3.1 Maximum Entropy
The maximum entropy framework estimates prob-
abilities based on the principle of making as few
assumptions as possible, other than the constraints
imposed. Such constraints are derived from train-
ing data, expressing some relationship between fea-
tures and outcome. The probability distribution
that satisfies the above property is the one with
the highest entropy. It is unique, agrees with the
maximum-likelihood distribution, and has the expo-
nential form (Della Pietra et al., 1997):
a0a2a1 a15 a5a3a2 a9a5a4
a6
a7
a1a8a2 a9
a9
a10
a11a13a12a15a14a17a16
a18a20a19a22a21a24a23a26a25a27a29a28
a11
a11
where a15 refers to the outcome, a2 the history (or con-
text), and a7 a1a20a2 a9 is a normalization function. In addi-
tion, each feature function a30 a11 a1a8a2 a11a14a15a18a9 is a binary func-
tion. For example, in predicting if a word belongs
to a word class, a15 is either true or false, and a2 refers
to the surrounding context:
a30
a11
a1a8a2 a11 a15 a9a31a4a33a32
a6 if
a15 = true, previous word = the
a34 otherwise
The parameters
a16
a11 are estimated by a procedure
called Generalized Iterative Scaling (GIS) (Darroch
and Ratcliff, 1972). This is an iterative method that
improves the estimation of the parameters at each it-
eration. We have used the Java-based opennlp max-
imum entropy package1.
1http://maxent.sourceforge.net
3.2 Testing
During testing, it is possible that the classifier
produces a sequence of inadmissible classes (e.g.,
person begin followed by location unique). To
eliminate such sequences, we define a transition
probability between word classes a0 a1 a17a2a1 a5a8a17 a11 a9 to be
equal to 1 if the sequence is admissible, and 0
otherwise. The probability of the classes a17 a14 a11a4a3a5a3a5a3 a11a14a17a7a6
assigned to the words in a sentence a8 in a document
a13 is defined as follows:
a0
a1 a17
a14
a11a5a3a4a3a5a3 a11 a17 a6 a5
a8
a11 a13a20a9a5a4
a6
a10
a1
a12a15a14
a0
a1 a17 a1 a5
a8
a11 a13a20a9a10a9
a0
a1a4a17 a1 a5a17 a1a12a11
a14
a9a19a11
where a0 a1a4a17 a1 a5a8 a11a14a13a16a9 is determined by the maximum
entropy classifier. A dynamic programming algo-
rithm is then used to select the sequence of word
classes with the highest probability.
4 Feature Description
The features we used can be divided into 2 classes:
local and global. Local features are features that are
based on neighboring tokens, as well as the token
itself. Global features are extracted from other oc-
currences of the same token in the whole document.
The local features used are similar to those used
in BBN's IdentiFinder (Bikel et al., 1999) or MENE
(Borthwick, 1999). However, to classify a token
a13 , while Borthwick uses tokens from a13a14a11a16a15 to a13a18a17a19a15
(from two tokens before to two tokens after a13 ),
we used only the tokens a13 a11 a14 , a13 , and a13 a17 a14 . Even
with local features alone, MENERGI outperforms
MENE (Borthwick, 1999). This might be because
our features are more comprehensive than those
used by Borthwick. In IdentiFinder, there is a prior-
ity in the feature assignment, such that if one feature
is used for a token, another feature lower in priority
will not be used. In the maximum entropy frame-
work, there is no such constraint. Multiple features
can be used for the same token.
Feature selection is implemented using a feature
cutoff: features seen less than a small count dur-
ing training will not be used. We group the features
used into feature groups. Each feature group can be
made up of many binary features. For each token
a13 , zero, one, or more of the features in each feature
group are set to 1.
4.1 Local Features
The local feature groups are:
Non-Contextual Feature: This feature is set to
1 for all tokens. This feature imposes constraints
Token satisfies Example Feature
Starts with a capital Mr. InitCap-
letter, ends with a period Period
Contains only one A OneCap
capital letter
All capital letters and CORP. AllCaps-
period Period
Contains a digit AB3, Contain-
747 Digit
Made up of 2 digits 99 TwoD
Made up of 4 digits 1999 FourD
Made up of digits 01/01 Digit-
and slash slash
Contains a dollar sign US$20 Dollar
Contains a percent sign 20% Percent
Contains digit and period $US3.20 Digit-
Period
Table 1: Features based on the token string
that are based on the probability of each name class
during training.
Zone: MUC data contains SGML tags, and a
document is divided into zones (e.g., headlines and
text zones). The zone to which a token belongs is
used as a feature. For example, in MUC-6, there are
four zones (TXT, HL, DATELINE, DD). Hence, for
each token, one of the four features zone-TXT, zone-
HL, zone-DATELINE, or zone-DD is set to 1, and
the other 3 are set to 0.
Case and Zone: If the token a13 starts with a capi-
tal letter (initCaps), then an additional feature (init-
Caps, zone) is set to 1. If it is made up of all capital
letters, then (allCaps, zone) is set to 1. If it starts
with a lower case letter, and contains both upper
and lower case letters, then (mixedCaps, zone) is set
to 1. A token that is allCaps will also be initCaps.
This group consists of (3 a0 total number of possible
zones) features.
Case and Zone of a13 a17 a14 and a13 a11 a14 : Similarly,
if a13 a17 a14 (or a13 a11 a14 ) is initCaps, a feature (initCaps,
zone)a20a16a21a23a22a25a24 (or (initCaps, zone)a26a23a27a16a21a29a28 ) is set to 1,
etc.
Token Information: This group consists of 10
features based on the string a13 , as listed in Table 1.
For example, if a token starts with a capital letter
and ends with a period (such as Mr.), then the fea-
ture InitCapPeriod is set to 1, etc.
First Word: This feature group contains only one
feature firstword. If the token is the first word of a
sentence, then this feature is set to 1. Otherwise, it
is set to 0.
Lexicon Feature: The string of the token a13 is
used as a feature. This group contains a large num-
ber of features (one for each token string present in
the training data). At most one feature in this group
will be set to 1. If a13 is seen infrequently during
training (less than a small count), then a13 will not be
selected as a feature and all features in this group
are set to 0.
Lexicon Feature of Previous and Next Token:
The string of the previous token a13 a11 a14 and the next
token a13 a17 a14 is used with the initCaps information
of a13 . If a13 has initCaps, then a feature (initCaps,
a13 a17
a14 )
a20a16a21a23a22a25a24 is set to 1. If
a13 is not initCaps, then
(not-initCaps, a13 a17 a14 )a20a16a21a23a22a25a24 is set to 1. Same for
a13 a11
a14 . In the case where the next token
a13 a17
a14 is a
hyphen, then a13 a17a19a15 is also used as a feature: (init-
Caps, a13 a17a19a15 )a20a16a21a23a22a25a24 is set to 1. This is because in
many cases, the use of hyphens can be considered
to be optional (e.g., third-quarter or third quarter).
Out-of-Vocabulary: We derived a lexicon list
from WordNet 1.6, and words that are not found in
this list have a feature out-of-vocabulary set to 1.
Dictionaries: Due to the limited amount of train-
ing material, name dictionaries have been found to
be useful in the named entity task. The importance
of dictionaries in NERs has been investigated in the
literature (Mikheev et al., 1999). The sources of our
dictionaries are listed in Table 2. For all lists ex-
cept locations, the lists are processed into a list of
tokens (unigrams). Location list is processed into a
list of unigrams and bigrams (e.g., New York). For
locations, tokens are matched against unigrams, and
sequences of two consecutive tokens are matched
against bigrams. A list of words occurring more
than 10 times in the training data is also collected
(commonWords). Only tokens with initCaps not
found in commonWords are tested against each list
in Table 2. If they are found in a list, then a feature
for that list will be set to 1. For example, if Barry is
not in commonWords and is found in the list of per-
son first names, then the feature PersonFirstName
will be set to 1. Similarly, the tokens a13 a17 a14 and a13 a11 a14
are tested against each list, and if found, a corre-
sponding feature will be set to 1. For example, if
a13 a17
a14 is found in the list of person first names, the
feature PersonFirstNamea20a16a21a23a22a25a24 is set to 1.
Month Names, Days of the Week, and Num-
bers: If a13 is initCaps and is one of January, Febru-
ary, . . . , December, then the feature MonthName is
set to 1. If a13 is one of Monday, Tuesday, . . . , Sun-
day, then the feature DayOfTheWeek is set to 1. If a13
is a number string (such as one, two, etc), then the
feature NumberString is set to 1.
Suffixes and Prefixes: This group contains only
two features: Corporate-Suffix and Person-Prefix.
Two lists, Corporate-Suffix-List (for corporate suf-
fixes) and Person-Prefix-List (for person prefixes),
are collected from the training data. For corporate
suffixes, a list of tokens cslist that occur frequently
as the last token of an organization name is col-
lected from the training data. Frequency is calcu-
lated by counting the number of distinct previous
tokens that each token has (e.g., if Electric Corp. is
seen 3 times, and Manufacturing Corp. is seen 5
times during training, and Corp. is not seen with
any other preceding tokens, then the “frequency”
of Corp. is 2). The most frequently occurring last
words of organization names in cslist are compiled
into a list of corporate suffixes, Corporate-Suffix-
List. A Person-Prefix-List is compiled in an anal-
ogous way. For MUC-6, for example, Corporate-
Suffix-List is made up of a0 ltd., associates, inc., co,
corp, ltd, inc, committee, institute, commission, uni-
versity, plc, airlines, co., corp.a1 and Person-Prefix-
List is made up of a0 succeeding, mr., rep., mrs., sec-
retary, sen., says, minister, dr., chairman, ms.a1 . For
a token a13 that is in a consecutive sequence of init-
Caps tokens a1 a13 a11a3a2 a11a5a3a4a3a5a3 a11a7a13 a11a4a3a5a3a2a3 a11 a13 a17 a6 a9 , if any of the
tokens from a13 a17 a14 to a13 a17 a6 is in Corporate-Suffix-List,
then a feature Corporate-Suffix is set to 1. If any of
the tokens from a13 a11a3a2a18a11 a14 to a13 a11 a14 is in Person-Prefix-
List, then another feature Person-Prefix is set to 1.
Note that we check for a13a14a11a3a2 a11 a14 , the word preceding
the consecutive sequence of initCaps tokens, since
person prefixes like Mr., Dr., etc are not part of per-
son names, whereas corporate suffixes like Corp.,
Inc., etc are part of corporate names.
4.2 Global Features
Context from the whole document can be impor-
tant in classifying a named entity. A name already
mentioned previously in a document may appear in
abbreviated form when it is mentioned again later.
Previous work deals with this problem by correcting
inconsistencies between the named entity classes
assigned to different occurrences of the same entity
(Borthwick, 1999; Mikheev et al., 1998). We of-
ten encounter sentences that are highly ambiguous
in themselves, without some prior knowledge of the
entities concerned. For example:
McCann initiated a new global system. (1)
CEO of McCann . . . (2)
Description Source
Location Names http://www.timeanddate.com
http://www.cityguide.travel-guides.com
http://www.worldtravelguide.net
Corporate Names http://www.fmlx.com
Person First Names http://www.census.gov/genealogy/names
Person Last Names
Table 2: Sources of Dictionaries
The McCann family . . . (3)
In sentence (1), McCann can be a person or an orga-
nization. Sentence (2) and (3) help to disambiguate
one way or the other. If all three sentences are in the
same document, then even a human will find it dif-
ficult to classify McCann in (1) into either person or
organization, unless there is some other information
provided.
The global feature groups are:
InitCaps of Other Occurrences (ICOC): There
are 2 features in this group, checking for whether
the first occurrence of the same word in an un-
ambiguous position (non first-words in the TXT or
TEXT zones) in the same document is initCaps or
not-initCaps. For a word whose initCaps might be
due to its position rather than its meaning (in head-
lines, first word of a sentence, etc), the case infor-
mation of other occurrences might be more accurate
than its own. For example, in the sentence that starts
with “Bush put a freeze on . . . ”, because Bush is
the first word, the initial caps might be due to its
position (as in “They put a freeze on . . . ”). If some-
where else in the document we see “restrictions put
in place by President Bush”, then we can be surer
that Bush is a name.
Corporate Suffixes and Person Prefixes of
Other Occurrences (CSPP): If McCann has been
seen as Mr. McCann somewhere else in the docu-
ment, then one would like to give person a higher
probability than organization. On the other hand,
if it is seen as McCann Pte. Ltd., then organization
will be more probable. With the same Corporate-
Suffix-List and Person-Prefix-List used in local fea-
tures, for a token a13 seen elsewhere in the same doc-
ument with one of these suffixes (or prefixes), an-
other feature Other-CS (or Other-PP) is set to 1.
Acronyms (ACRO): Words made up of all cap-
italized letters in the text zone will be stored as
acronyms (e.g., IBM). The system will then look
for sequences of initial capitalized words that match
the acronyms found in the whole document. Such
sequences are given additional features of A begin,
A continue, or A end, and the acronym is given a
feature A unique. For example, if FCC and Federal
Communications Commission are both found in a
document, then Federal has A begin set to 1, Com-
munications has A continue set to 1, Commission
has A end set to 1, and FCC has A unique set to 1.
Sequence of Initial Caps (SOIC): In the sen-
tence Even News Broadcasting Corp., noted for its
accurate reporting, made the erroneous announce-
ment., a NER may mistake Even News Broadcast-
ing Corp. as an organization name. However, it
is unlikely that other occurrences of News Broad-
casting Corp. in the same document also co-occur
with Even. This group of features attempts to cap-
ture such information. For every sequence of initial
capitalized words, its longest substring that occurs
in the same document as a sequence of initCaps
is identified. For this example, since the sequence
Even News Broadcasting Corp. only appears once
in the document, its longest substring that occurs in
the same document is News Broadcasting Corp. In
this case, News has an additional feature of I begin
set to 1, Broadcasting has an additional feature of
I continue set to 1, and Corp. has an additional fea-
ture of I end set to 1.
Unique Occurrences and Zone (UNIQ): This
group of features indicates whether the word a13 is
unique in the whole document. a13 needs to be in
initCaps to be considered for this feature. If a13 is
unique, then a feature (Unique, Zone) is set to 1,
where Zone is the document zone where a13 appears.
As we will see from Table 3, not much improvement
is derived from this feature.
5 Experimental Results
The baseline system in Table 3 refers to the maxi-
mum entropy system that uses only local features.
As each global feature group is added to the list of
features, we see improvements to both MUC-6 and
MUC-6 MUC-7
Baseline 90.75% 85.22%
+ ICOC 91.50% 86.24%
+ CSPP 92.89% 86.96%
+ ACRO 93.04% 86.99%
+ SOIC 93.25% 87.22%
+ UNIQ 93.27% 87.24%
Table 3: F-measure after successive addition of each
global feature group
MUC-6 MUC-7
Systems No. of No. of No. of No. of
Articles Tokens Articles Tokens
MENERGI 318 160,000 200 180,000
IdentiFinder – 650,000 – 790,000
MENE – – 350 321,000
Table 4: Training Data
MUC-7 test accuracy.2 For MUC-6, the reduction in
error due to global features is 27%, and for MUC-7,
14%. ICOC and CSPP contributed the greatest im-
provements. The effect of UNIQ is very small on
both data sets.
All our results are obtained by using only the of-
ficial training data provided by the MUC confer-
ences. The reason why we did not train with both
MUC-6 and MUC-7 training data at the same time
is because the task specifications for the two tasks
are not identical. As can be seen in Table 4, our
training data is a lot less than those used by MENE
and IdentiFinder3. In this section, we try to com-
pare our results with those obtained by IdentiFinder
'97 (Bikel et al., 1997), IdentiFinder '99 (Bikel et
al., 1999), and MENE (Borthwick, 1999). Iden-
tiFinder '99's results are considerably better than
IdentiFinder '97's. IdentiFinder's performance in
MUC-7 is published in (Miller et al., 1998). MENE
has only been tested on MUC-7.
For fair comparison, we have tabulated all results
with the size of training data used (Table 5 and Ta-
ble 6). Besides size of training data, the use of
dictionaries is another factor that might affect per-
formance. Bikel et al. (1999) did not report using
any dictionaries, but mentioned in a footnote that
they have added list membership features, which
have helped marginally in certain domains. Borth-
2MUC data can be obtained from the Linguistic Data Con-
sortium: http://www.ldc.upenn.edu
3Training data for IdentiFinder is actually given in words
(i.e., 650K & 790K words), rather than tokens
Systems Size of training data F-measure
SRA '95 Hand-coded 96.4%
IdentiFinder '99 650,000 words 94.9%
MENERGI 160,000 tokens 93.27%
IdentiFinder '99 a0 200,000 words About 93%
(from graph)
IdentiFinder '97 450,000 words 93%
IdentiFinder '97 about 100,000 words 91%-92%
Table 5: Comparison of results for MUC-6
Systems Size of training data F-measure
LTG system '98 Hybrid hand-coded 93.39%
IdentiFinder '98 790,000 words 90.44%
MENE + Proteus Hybrid hand-coded 88.80%
'98 321,000 tokens
MENERGI 180,000 tokens 87.24%
MENE+reference- 321,000 tokens 86.56%
resolution '99
MENE '98 321,000 tokens 84.22%
Table 6: Comparison of results for MUC-7
wick (1999) reported using dictionaries of person
first names, corporate names and suffixes, colleges
and universities, dates and times, state abbrevia-
tions, and world regions.
In MUC-6, the best result is achieved by SRA
(Krupka, 1995). In (Bikel et al., 1997) and (Bikel et
al., 1999), performance was plotted against training
data size to show how performance improves with
training data size. We have estimated the perfor-
mance of IdentiFinder '99 at 200K words of training
data from the graphs.
For MUC-7, there are also no published results
on systems trained on only the official training data
of 200 aviation disaster articles. In fact, training on
the official training data is not suitable as the articles
in this data set are entirely about aviation disasters,
and the test data is about air vehicle launching. Both
BBN and NYU have tagged their own data to sup-
plement the official training data. Even with less
training data, MENERGI outperforms Borthwick's
MENE + reference resolution (Borthwick, 1999).
Except our own and MENE + reference resolution,
the results in Table 6 are all official MUC-7 results.
The effect of a second reference resolution clas-
sifier is not entirely the same as that of global fea-
tures. A secondary reference resolution classifier
has information on the class assigned by the pri-
mary classifier. Such a classification can be seen
as a not-always-correct summary of global features.
The secondary classifier in (Borthwick, 1999) uses
information not just from the current article, but also
from the whole test corpus, with an additional fea-
ture that indicates if the information comes from the
same document or from another document. We feel
that information from a whole corpus might turn
out to be noisy if the documents in the corpus are
not of the same genre. Moreover, if we want to
test on a huge test corpus, indexing the whole cor-
pus might prove computationally expensive. Hence
we decided to restrict ourselves to only information
from the same document.
Mikheev et al. (1998) have also used a maximum
entropy classifier that uses already tagged entities
to help tag other entities. The overall performance
of the LTG system was outstanding, but the system
consists of a sequence of many hand-coded rules
and machine-learning modules.
6 Conclusion
We have shown that the maximum entropy frame-
work is able to use global information directly. This
enables us to build a high performance NER with-
out using separate classifiers to take care of global
consistency or complex formulation on smoothing
and backoff models (Bikel et al., 1997). Using less
training data than other systems, our NER is able
to perform as well as other state-of-the-art NERs.
Information from a sentence is sometimes insuffi-
cient to classify a name correctly. Global context
from the whole document is available and can be ex-
ploited in a natural manner with a maximum entropy
classifier. We believe that the underlying principles
of the maximum entropy framework are suitable for
exploiting information from diverse sources. Borth-
wick (1999) successfully made use of other hand-
coded systems as input for his MENE system, and
achieved excellent results. However, such an ap-
proach requires a number of hand-coded systems,
which may not be available in languages other than
English. We believe that global context is useful in
most languages, as it is a natural tendency for au-
thors to use abbreviations on entities already men-
tioned previously.

References

Daniel M. Bikel, Scott Miller, Richard Schwartz,
and Ralph Weischedel. 1997. Nymble: A high-
performance learning name-finder. In Proceed-
ings of the Fifth Conference on Applied Natural
Language Processing, pages 194–201.

Daniel M. Bikel, Richard Schwartz, and Ralph M.
Weischedel. 1999. An algorithm that learns
what's in a name. Machine Learning,
34(1/2/3):211–231.

Andrew Borthwick. 1999. A Maximum Entropy
Approach to Named Entity Recognition. Ph.D.
thesis, Computer Science Department, New York
University.

Nancy Chinchor. 1998. MUC-7 named entity task
definition, version 3.5. In Proceedings of the Sev-
enth Message Understanding Conference.

J. N. Darroch and D. Ratcliff. 1972. Generalized
iterative scaling for log-linear models. Annals of
Mathematical Statistics, 43(5):1470–1480.

Stephen Della Pietra, Vincent Della Pietra, and
John Lafferty. 1997. Inducing features of ran-
dom fields. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 19(4):380–393.

George R. Krupka. 1995. SRA: Description of the
SRA system as used for MUC-6. In Proceedings
of the Sixth Message Understanding Conference,
pages 221–235.

Andrei Mikheev, Claire Grover, and Marc Moens.
1998. Description of the LTG system used for
MUC-7. In Proceedings of the Seventh Message
Understanding Conference.

Andrei Mikheev, Marc Moens, and Claire Grover.
1999. Named entity recognition without
gazetteers. In Proceedings of the Ninth Confer-
ence of the European Chapter of the Association
for Computational Linguistics, pages 1–8.

Scott Miller, Michael Crystal, Heidi Fox, Lance
Ramshaw, Richard Schwartz, Rebecca Stone,
Ralph Weischedel, and the Annotation Group.
1998. Algorithms that learn to extract informa-
tion BBN: Description of the SIFT system as
used for MUC-7. In Proceedings of the Seventh
Message Understanding Conference.

Beth M. Sundheim. 1995. Named entity task def-
inition, version 2.1. In Proceedings of the Sixth
Message Understanding Conference, pages 319–
332.
