A Statistical Model for Multilingual Entity Detection and Tracking
R. Florian, H. Hassana0 , A. Ittycheriah, H. Jing
N. Kambhatla, X. Luo, N. Nicolov, and S. Roukos
I.B.M. T.J. Watson Research Center
Yorktown Heights, NY 10598
{raduf,abei,hjing,nanda,xiaoluo, nicolas,roukos}@us.ibm.com
a1 hanyh@eg.ibm.com
Abstract
Entity detection and tracking is a relatively new
addition to the repertoire of natural language
tasks. In this paper, we present a statistical
language-independent framework for identify-
ing and tracking named, nominal and pronom-
inal references to entities within unrestricted
text documents, and chaining them into clusters
corresponding to each logical entity present in
the text. Both the mention detection model
and the novel entity tracking model can use
arbitrary feature types, being able to integrate
a wide array of lexical, syntactic and seman-
tic features. In addition, the mention detec-
tion model crucially uses feature streams de-
rived from different named entity classifiers.
The proposed framework is evaluated with sev-
eral experiments run in Arabic, Chinese and
English texts; a system based on the approach
described here and submitted to the latest Au-
tomatic Content Extraction (ACE) evaluation
achieved top-tier results in all three evaluation
languages.
1 Introduction
Detecting entities, whether named, nominal or pronom-
inal, in unrestricted text is a crucial step toward under-
standing the text, as it identifies the important concep-
tual objects in a discourse. It is also a necessary step for
identifying the relations present in the text and populating
a knowledge database. This task has applications in in-
formation extraction and summarization, information re-
trieval (one can get all hits for Washington/person and not
the ones for Washington/state or Washington/city), data
mining and question answering.
The Entity Detection and Tracking task (EDT hence-
forth) has close ties to the named entity recognition
(NER) and coreference resolution tasks, which have been
the focus of attention of much investigation in the recent
past (Bikel et al., 1997; Borthwick et al., 1998; Mikheev
et al., 1999; Miller et al., 1998; Aberdeen et al., 1995;
Ng and Cardie, 2002; Soon et al., 2001), and have been
at the center of several evaluations: MUC-6, MUC-7,
CoNLL’02 and CoNLL’03 shared tasks. Usually, in com-
putational linguistic literature, a named entity represents
an instance of a name, either a location, a person, an or-
ganization, and the NER task consists of identifying each
individual occurrence of such an entity. We will instead
adopt the nomenclature of the Automatic Content Extrac-
tion program1 (NIST, 2003a): we will call the instances
of textual references to objects or abstractions mentions,
which can be either named (e.g. John Mayor), nominal
(e.g. the president) or pronominal (e.g. she, it). An entity
consists of all the mentions (of any level) which refer to
one conceptual entity. For instance, in the sentence
President John Smith said he has no comments.
there are two mentions: John Smith and he (in the order
of appearance, their levels are named and pronominal),
but one entity, formed by the set {John Smith, he}.
In this paper, we present a general statistical frame-
work for entity detection and tracking in unrestricted text.
The framework is not language specific, as proved by ap-
plying it to three radically different languages: Arabic,
Chinese and English. We separate the EDT task into a
mention detection part – the task of finding all mentions
in the text – and an entity tracking part – the task of com-
bining the detected mentions into groups of references to
the same object.
The work presented here is motivated by the ACE eval-
uation framework, which has the more general goal of
building multilingual systems which detect not only enti-
ties, but also relations among them and, more recently,
events in which they participate. The EDT task is ar-
guably harder than traditional named entity recognition,
because of the additional complexity involved in extract-
ing non-named mentions (nominals and pronouns) and
the requirement of grouping mentions into entities.
We present and evaluate empirically statistical mod-
els for both mention detection and entity tracking prob-
lems. For mention detection we use approaches based on
Maximum Entropy (MaxEnt henceforth) (Berger et al.,
1996) and Robust Risk Minimization (RRM henceforth)
1For a description of the ACE program see
http://www.nist.gov/speech/tests/ace/.
(Zhang et al., 2002). The task is transformed into a se-
quence classification problem. We investigate a wide ar-
ray of lexical, syntactic and semantic features to perform
the mention detection and classification task including,
for all three languages, features based on pre-existing sta-
tistical semantic taggers, even though these taggers have
been trained on different corpora and use different seman-
tic categories. Moreover, the presented approach implic-
itly learns the correlation between these different seman-
tic types and the desired output types.
We propose a novel MaxEnt-based model for predict-
ing whether a mention should or should not be linked to
an existing entity, and show how this model can be used
to build entity chains. The effectiveness of the approach
is tested by applying it on data from the above mentioned
languages — Arabic, Chinese, English.
The framework presented in this paper is language-
universal – the classification method does not make any
assumption about the type of input. Most of the fea-
ture types are shared across the languages, but there are a
small number of useful feature types which are language-
specific, especially for the mention detection task.
The paper is organized as follows: Section 2 describes
the algorithms and feature types used for mention detec-
tion. Section 3 presents our approach to entity tracking.
Section 4 describes the experimental framework and the
systems’ results for Arabic, Chinese and English on the
data from the latest ACE evaluation (September 2003), an
investigation of the effect of using different feature types,
as well as a discussion of the results.
2 Mention Detection
The mention detection system identifies the named, nom-
inal and pronominal mentions introduced in the previous
section. Similarly to classical NLP tasks such as base
noun phrase chunking (Ramshaw and Marcus, 1994), text
chunking (Ramshaw and Marcus, 1995) or named entity
recognition (Tjong Kim Sang, 2002), we formulate the
mention detection problem as a classification problem,
by assigning to each token in the text a label, indicating
whether it starts a specific mention, is inside a specific
mention, or is outside any mentions.
2.1 The Statistical Classifiers
Good performance in many natural language process-
ing tasks, such as part-of-speech tagging, shallow pars-
ing and named entity recognition, has been shown to de-
pend heavily on integrating many sources of information
(Zhang et al., 2002; Jing et al., 2003; Ittycheriah et al.,
2003). Given the stated focus of integrating many feature
types, we are interested in algorithms that can easily in-
tegrate and make effective use of diverse input types. We
selected two methods which satisfy these criteria: a linear
classifier – the Robust Risk Minimization classifier – and
a log-linear classifier – the Maximum Entropy classifier.
Both methods can integrate arbitrary types of informa-
tion and make a classification decision by aggregating all
information available for a given classification.
Before formally describing the methods2, we introduce
some notations: let a0a2a1a4a3a6a5a6a7a9a8a11a10a12a10a11a10a13a8a14a5a13a15a17a16 be the set of pre-
dicted classes, a18 be the example space anda19 a1a20a3a6a21a22a8a12a23a24a16a6a25
be the feature space. Each examplea26a28a27a29a18 has associated
a vector of binary features a30a32a31a33a26a35a34 a1 a31a36a30 a7 a31a33a26a17a34 a8a11a10a11a10a12a10a13a8a30 a25 a31a37a26a35a34a38a34 .
We also assume the existence of a training data set a39a41a40
a18 and a test set a42a43a40a44a18 .
The RRM algorithm (Zhang et al., 2002) constructs a45
linear classifiers a31a47a46a49a48a36a34
a48a51a50
a7a53a52a54a52a54a52a15 (one for each predicted class),
each predicting whether the current example belongs to
the class or not. Every such classifier a46a55a48 has an associ-
ated feature weight vector, a31a37a56 a48a54a57 a34
a57a38a50
a7a58a52a54a52a54a52
a25
, which is learned
during the training phase so as to minimize the classifica-
tion error rate3. At test time, for each examplea26a29a27a59a42 , the
model computes a score
a60
a48a61a31a37a26a35a34
a1
a25
a62
a57a38a50
a7
a56a55a48a63a57a65a64a12a30a12a57a66a31a37a26a35a34
and labels the example with either the class correspond-
ing to the classifier with the highest score, if above
0, or outside, otherwise. The full decoding algorithm
is presented in Algorithm 1. This algorithm can also
be used for sequence classification (Williams and Peng,
1990), by converting the activation scores into probabili-
ties (through the soft-max function, for instance) and us-
ing the standard dynamic programing search algorithm
(also known as Viterbi search).
Algorithm 1 The RRM Decoding Algorithm
foreacha26a29a27a67a42
foreacha68 a1a20a23a69a10a11a10a12a10a45
a60
a48a22a70a26a22a71
a1a73a72a2a25
a57a38a50
a7
a56 a48a54a57 a64a74a30 a57 a31a33a26a17a34
a5a11a75
a60a77a76a74a76
a31a78a26a35a34a66a79a81a80a83a82a38a84a69a85a86a80a24a87a88a48
a60
a48 a70a26a61a71
Somewhat similarly, the MaxEnt algorithm has an as-
sociated set of weights a31a33a89 a48a54a57 a34a48a90a50
a7a53a52a54a52a54a52a15
a57a38a50
a7a58a52a54a52a54a52
a25
, which are estimated
during the training phase so as to maximize the likelihood
of the data (Berger et al., 1996). Given these weights, the
model computes the probability distribution of a particu-
lar examplea26 as follows:
a91
a31
a5
a48a14a92a26a17a34
a1
a23
a93
a25
a94
a57a38a50
a7
a89a96a95a98a97a13a99a51a100a12a101
a48a54a57
a8 a93 a1
a62
a48
a94
a57
a89a96a95a98a97a11a99a90a100a6a101
a48a63a57
where a93 is a normalization factor.
After computing the class probability distribution, the
assigned class is the most probable one a posteriori. The
sketch of applying MaxEnt to the test data is presented
in Algorithm 2. Similarly to the RRM model, we use
the model to perform sequence classification, through dy-
namic programing.
2This is not meant to be an in-depth introduction to the meth-
ods, but a brief overview to familiarize the reader with them.
3Actually, the optimizing function contains a regularization
factor which considerably improves the robustness of the sys-
tem — for full details, see Zhang et al. (2002).
Algorithm 2 The MaxEnt Decoding Algorithm
foreacha26a29a27a67a42
a93
a79
a21
foreacha68 a1 a23a69a10a12a10a11a10a45
a1
a48a70a26a22a71
a1
a25
a2
a57a38a50
a7 a89 a95a97 a99a90a100a6a101
a48a63a57
Normalize (p)
a5a13a75
a60a77a76a74a76
a31a37a26a17a34a69a79 a80a24a82a14a84a69a85a86a80a9a87a77a48
a1
a48a70a26a61a71
Within this framework, any type of feature can be used,
enabling the system designer to experiment with interest-
ing feature types, rather than worry about specific feature
interactions. In contrast, in a rule based system, the sys-
tem designer would have to consider how, for instance,
a WordNet (Miller, 1995) derived information for a par-
ticular example interacts with a part-of-speech-based in-
formation and chunking information. That is not to say,
ultimately, that rule-based systems are in some way infe-
rior to statistical models – they are built using valuable
insight which is hard to obtain from a statistical-model-
only approach. Instead, we are just suggesting that the
output of such a system can be easily integrated into the
previously described framework, as one of the input fea-
tures, most likely leading to improved performance.
2.2 The Combination Hypothesis
In addition to using rich lexical, syntactic, and semantic
features, we leveraged several pre-existing mention tag-
gers. These pre-existing taggers were trained on datasets
outside of ACE training data and they identify types of
mentions different from the ACE types of mentions . For
instance, a pre-existing tagger may identify dates or oc-
cupation mentions (not used in ACE), among other types.
It could also have a class called PERSON, but the anno-
tation guideline of what represents a PERSON may not
match exactly to the notion of the PERSON type in ACE.
Our hypothesis – the combination hypothesis – is that
combining pre-existing classifiers from diverse sources
will boost performance by injecting complementary in-
formation into the mention detection models. Hence, we
used the output of these pre-existing taggers and used
them as additional feature streams for the mention de-
tection models. This approach allows the system to au-
tomatically correlate the (different) mention types to the
desired output.
2.3 Language-Independent Features
Even if the three languages (Arabic, Chinese and English)
are radically different syntacticly, semantically, and even
graphically, all models use a few universal types of fea-
tures, while others are language-specific. Let us note
again that, while some types of features only apply to
one language, the models have the same basic structure,
treating the problem as an abstract classification task.
The following is a list of the features that are shared
across languages (a56 a48 is considered by default the current
token):
a3 tokens4 in a window of
a4 :
a3
a56a55a48a6a5a8a7
a8a11a10a12a10a11a10a13a8
a56a49a48a10a9a11a7
a16 ;
a3 the part-of-speech associated with token
a56 a48
a3 dictionary information (whether the current token
is part of a large collection of dictionaries - one
boolean value for each dictionary)
a3 the output of named mention detectors trained on
different style of entities.
a3 the previously assigned classification tags5.
The following sections describe in detail the language-
specific features, and Table 1 summarizes the feature
types used in building the models in the three languages.
Finally, the experiments in Section 4 detail the perfor-
mance obtained by using selected combinations of fea-
ture subsets.
2.4 Arabic Mention Detection
Arabic, a highly inflected language, has linguistic pecu-
liarities that affect any mention detection system. An im-
portant aspect that needs to be addressed is segmentation:
which style should be used, how to deal with the inher-
ent segmentation ambiguity of mention names, especially
persons and locations, and, finally, how to handle the at-
tachment of pronouns to stems. Arabic blank-delimited
words are composed of zero or more prefixes, followed
by a stem and zero or more suffixes. Each prefix, stem or
suffix will be called a token in this discussion; any con-
tiguous sequence of tokens can represent a mention.
For example, the word “trwmAn” (translation: “Tru-
man”) could be segmented in 3 tokens (for instance, if
the word was not seen in the training data):
trwmAna12 ta13 rwma13 An
which introduces ambiguity, as the three tokens form re-
ally just one mention, and, in the case of the word “tm-
nEh”, which has the segmentation
tmnEha12 ta13 mnEa13 h
the first and third tokens should both be labeled as
pronominal mentions – but, to do this, they need to be
separated from the stem mnE.
Pragmatically, we found segmenting Arabic text to be a
necessary and beneficial process due mainly to two facts:
1. some prefixes/suffixes can receive a different men-
tion type than the stem they are glued to (for in-
stance, in the case of pronouns);
2. keeping words together results in significant data
sparseness, because of the inflected nature of the
language.
4Each language may have a different notion of what repre-
sents a token.
5In the current implementation, the models use a history of
2 tags.
Feature Type Ar Zh En
Token in window of 5 a0 a0 a0
Morph in window of 5 a0 N/A a0
POS info a0 a0 a0
Text chunking info — — a0
Capitalization/word-type N/A N/A a0
Prefixes/suffixes a0 N/A a0
Gazetteer info a0 a0 a0
Gap — — a0
Wordnet info — — a0
Segmentation a0 a0 N/A
Additional systems’ output a0 a0 a0
Table 1: Summary of features used by the 3 systems
Given these observations, we decided to “condition” the
output of the system on the segmented data: the text is
first segmented into tokens, and the classification is then
performed on tokens. The segmentation model is similar
to the one presented by Lee et al. (2003), and obtains an
accuracy of about 98%.
In addition, special attention is paid to prefixes and suf-
fixes: in order to reduce the number of spurious tokens
we re-merge the prefixes or suffixes to their correspond-
ing stem if they are not essential to the classification pro-
cess. For this purpose, we collect the following statistics
for each prefix/suffix a1a17a76 from the ACE training data: the
frequency of a1a35a76 occurring as a mention by itself (a1 ) and
the frequency of a1a17a76 occurring as a part of mention (a91 ).
If the ratio a2 a3 is below a threshold (estimated on the de-
velopment data), a1a35a76 is re-merged with its corresponding
stem. Only few prefixes and suffixes were merged using
these criteria. This is appropriate for the ACE task, since
a large percentage of prefixes and suffixes are annotated
as pronoun mentions6.
In addition to the language-general features described
in Section 2.3, the Arabic system implements a feature
specifying for each token its original stem.
For this system, the gazetteer features are computed on
words, not on tokens; the gazetteers consist of 12000 per-
son names and 3000 location and country names, all of
which have been collected by few man-hours web brows-
ing. The system also uses features based on the output
of three additional mention detection classifiers: a RRM
model predicting 48 mention categories, a RRM model
and a HMM model predicting 32 mention categories.
2.5 Chinese Mention Detection
In Chinese text, unlike in Indo-European languages,
words neither are white-space delimited nor do they have
capitalization markers. Instead of a word-based model,
we build a character-based one, since word segmentation
6For some additional data, annotated with 32 named cate-
gories, mentioned later on, we use the same approach of col-
lecting the a4 and a5 statistics, but, since named mentions are
predominant and there are no pronominal mentions in that case,
most suffixes and some prefixes are merged back to their origi-
nal stem.
errors can lead to irrecoverable mention detection errors;
Jing et al. (2003) also observe that character-based mod-
els are better performing than word-based ones for Chi-
nese named entity recognition. Although the model is
character-based, segmentation information is still useful
and is integrated as an additional feature stream.
Some more information about additional resources
used in building the system:
a3 Gazetteers include dictionaries of 10k person
names, 8k location and country names, and 3k orga-
nization names, compiled from annotated corpora.
a3 There are four additional classifiers whose output is
used as features: a RRM model which outputs 32
named categories, a RRM model identifying 49 cat-
egories, a RRM model identifying 45 mention cat-
egories, and a RRM model that classifies whether a
character is an English character, a numeral or other.
2.6 English Mention Detection
The English mention detection model is similar to the
system described in (Ittycheriah et al., 2003)7.The fol-
lowing is a list of additional features (again, a56a65a48 is the
current token):
a3 Shallow parsing information associated with the to-
kens in window of 3;
a3 Prefixes/suffixes of length up to 4;
a3 A capitalization/word-type flag (similar to the ones
described by Bikel et al. (1997));
a3 Gazetteer information: a handful of location (55k
entries) person names (30k) and organizations (5k)
dictionaries;
a3 A combination of gazetteer, POS and capitalization
information, obtained as follows: if the word is a
closed-class word — select its class, else if it’s in
a dictionary — select that class, otherwise back-off
to its capitalization information; we call this feature
gap;
a3 WordNet information (the synsets and hypernyms of
the two most frequent senses of the word);
a3 The outputs of three systems (HMM, RRM and
MaxEnt) trained on a 32-category named entity data,
the output of an RRM system trained on the MUC-6
data, and the output of RRM model identifying 49
categories.
3 Entity Tracking
This section introduces a novel statistical approach to en-
tity tracking. We choose to model the process of forming
entities from mentions, one step at a time. The process
works from left to right: it starts with an initial entity
consisting of the first mention of a document, and the next
mention is processed by either linking it with one of the
7The main difference between their system and ours is that
they build a MaxEnt model capable of building hierarchical
structures – therefore treating the problem as a parsing task –
while our system treats the problem as a classification task.
existing entities, or starting a new entity. The process
could have as output any one of the possible partitions of
the mention set.8 Two separate models are used to score
the linking and starting actions, respectively.
3.1 Tracking Algorithm
Formally, let a3a1a0 a48a3a2 a23a5a4 a68 a4 a45 a16 be a45 mentions in a
document. Let a6 a2 a68a8a7a9a11a10 be the map from mention index
a68 to entity index a10 . For a mention index a12 a31
a23a13a4
a12
a4
a45 a34 ,
let us define a14a16a15
a1 a3
a6 a31
a23
a34
a8a11a10a12a10a11a10a13a8
a6 a31a17a12a19a18
a23
a34
a16
the set of indices of the partially-established entities to
the left of a0
a15
(note that
a14
a7 a1a21a20 ), and
a22
a15
a1 a3a24a23a26a25
a2a28a27 a27
a14 a15
a16 a8
the set of the partially-established entities.
Given that a22
a15
has been formed to the left of the ac-
tive mention a0
a15
, a0
a15
can take two possible actions: if
a6 a31a29a12a22a34 a27
a14a16a15
, then the active mention a0
a15
is said to link with
the entity a23a1a30
a99
a15
a101
; Otherwise it starts a new entity a23a24a30
a99
a15
a101
. At
training time, the action is known to us, and at testing
time, both hypotheses will be kept during search. Notice
that a sequence of such actions corresponds uniquely to
an entity outcome (or a partition of mentions). There-
fore, the problem of coreference resolution is equivalent
to ranking the action sequences.
In this work, a binary modela91 a31a32a31 a1 a23 a92a22
a15
a8a33a0
a15
a8a33a34 a1
a27 a34
is used to compute the link probability, where a27 a27
a14a35a15
, a31
is a23 iff a0
a15
links with a23a1a25 ; the random variable a34 is the
index of the partial entity to which a0
a15
is linking. Since
starting a new entity means that a0
a15
does not link with
any entities in a22
a15
, the probability of starting a new entity,
a91
a31a32a31
a1a2a21
a92
a22
a15
a8a33a0
a15
a34 , can be computed as
a91
a31a32a31
a1 a21
a92
a22
a15
a8a36a0
a15
a34
a1
a62
a25a38a37a40a39a42a41
a91
a31a32a31
a1a2a21a22a8a43a34 a1
a27a11a92
a22
a15
a8a33a0
a15
a34
a1
a23
a18
a62
a25a38a37a40a39a42a41
a91
a31
a34a73a1
a27a11a92
a22
a15
a8a36a0
a15
a34
a91
a31a29a31
a1a20a23
a92
a22
a15
a8a36a0
a15
a8a33a34 a1
a27 a34
(1)
Therefore, the probability of starting an entity can
be computed using the linking probabilities a91 a31a32a31 a1
a23
a92
a22
a15
a8a33a0
a15
a8a43a34 a1
a27 a34 , provided that the marginal
a91
a31
a34 a1
a27a11a92
a22
a15
a8a36a0
a15
a34 is known. While other models are possible, in
the results reported in this paper, a91 a31 a34a81a1 a27a11a92a22
a15
a8a33a0
a15
a34 is
approximated as:
a91
a31
a34 a1
a27a11a92
a22
a15
a8a36a0
a15
a34
a1
a44
a45 a46
a23 if
a27
a1
a80a83a82a38a84 a85a86a80a9a87 a48
a37a40a39 a41
a91
a31a32a31
a1a20a23
a92
a22
a15
a8a36a0
a15
a8a43a34 a1
a68a34
a21 otherwise
(2)
8The number of all possible partitions of a set is given by
the Bell number (Bell, 1934). This number is very large even
for a document with a moderate number of mentions: about
a47a28a48a26a49 a50 trillion for a 20-mention document. For practical reasons,
the search space has to be reduced to a reasonably small set of
hypotheses.
That is, the starting probability is just one minus the max-
imum linking probability.
Training directly the model a91 a31a32a31 a1 a23 a92a22
a15
a8a36a0
a15
a8a43a34a20a1
a68a34
is difficult since it depends on all partial entities a22
a15
. As
a first attempt of modeling the process from mentions to
entities, we make the following modeling assumptions:
a91
a31a29a31
a1 a23
a92
a22
a15
a8a33a0
a15
a8a33a34 a1
a68a98a34a8a51
a91
a31a32a31
a1a20a23
a92
a23
a48
a8a36a0
a15
a34 (3)
a51 a85 a80a24a87
a25
a37a40a52a54a53
a91
a31a32a31
a1 a23
a92
a0 a8a33a0
a15
a34
a10 (4)
Once the linking probabilitya91 a31a29a31
a1 a23
a92
a22
a15
a8a36a0
a15
a8a33a34 a1
a68a34
is available, the starting probability a91 a31a32a31 a1 a21 a92a22
a15
a8a33a0
a15
a34
can be computed using (1) and (2). The strategy used to
find the best set of entities is shown in Algorithm 3.
Algorithm 3 Coreference Decoding Algorithm
Input: mentions in text a1 a1a20a3a26a0 a48 a2 a23a83a8a11a10a12a10a11a10a13a8a45 a16
Output: a partition a22 of the set a1
a55
a79
a3 a22a57a56 a1 a3a24a3a1a0 a7 a16a24a16a83a16a59a58
a76
a5a61a60
a31
a22a62a56
a34
a1 a23
foreach a12 a1a21a63a77a8a12a10a11a10a11a10a11a8a45
a55a13a64
a79
a20
foreach a22 a27 a55
a22 a64
a79
a22a5a65 a3a24a3a1a0
a15
a16a83a16
a76
a5a66a60
a31
a22a19a64
a34a69a79
a76
a5a61a60
a31
a22
a34a96a64
a91
a31a17a31
a1 a21
a92
a22 a8a33a0
a15
a34
a55 a64
a79
a55 a64 a65 a3 a22 a64 a16
foreacha68a69a27
a14a16a15
a22 a64
a79 a31
a22a5a67 a3a1a23
a48
a16
a34
a65 a3a1a23
a48
a65 a3a26a0
a15
a16a24a16
a76
a5a66a60
a31
a22 a64
a34a66a79
a76
a5a66a60
a31
a22
a34 a64
a91
a31a29a31
a1 a23
a92
a22 a8a36a0
a15
a8a33a34 a1
a68a34
a55 a64
a79
a55 a64 a65 a3 a22 a64 a16
a55
a79
a1
a60a24a68
a45
a23
a31
a55 a64
a34
return a80a24a82a14a84 a85 a80a24a87
a69
a37a28a70
a76
a5a66a60
a31
a22
a34
3.2 Entity Tracking Features
A maximum entropy model is used to implement (4).
Atomic features used by the model include:
a3 string match – whether or not the mention strings of
a0 and a0
a15
are exactly match, or partially match;
a3 context – surrounding words or part-of-speech tags
(if available) of mentions a0 a8a33a0
a15
;
a3 mention count – how many times a mention string
appears in the document. The count is quantized;
a3 distance – distance between the two mentions in
words and sentences. This number is also quantized;
a3 editing distance – quantized editing distance be-
tween the two mentions;
a3 mention information – spellings of the two mentions
and other information (such as POS tags) if avail-
able; If a mention is a pronoun, the feature also com-
putes gender, plurality, possessiveness and reflexive-
ness;
a3 acronym – whether or not one mention is the
acronym of the other mention;
a3 syntactic features – whether or not the two mentions
appear in apposition. This information is extracted
from a parse tree, and can be computed only when a
parser is available;
Data Set Arabic Chinese English
Train 65.6k 86.5k 340.7k
Development Test 7.7k 7.2k 71k
Sep’03 Eval Test 93.5k 108.2k 60.7k
Table 2: Data statistics (number of tokens) for Arabic,
Chinese and English
Another category of features is created by taking con-
junction of the atomic features. For example, the model
can capture how far a pronoun mention is from a named
mention when the distance feature is used in conjunction
with mention information feature.
As it is the case with with mention detection ap-
proach presented in Section 2, most features used here are
language-independent and are instantiated from the train-
ing data, while some are language-specific, but mostly
because the resources were not available for the specific
language. For example, syntactic features are not used in
the Arabic system due to the lack of an Arabic parser.
Simple as it seems, the mention-pair model has been
shown to work well (Soon et al., 2001; Ng and Cardie,
2002). As will be shown in Section 4, the relatively
knowledge-lean feature sets work fairly well in our tasks.
Although we also use a mention-pair model, our
tracking algorithm differs from Soon et al. (2001),
Ng and Cardie (2002) in several aspects. First, the
mention-pair model is used as an approximation to the
entity-mention model (3), which itself is an approxima-
tion of a91 a31a32a31 a1 a23 a92a22
a15
a8a36a0
a15
a8a43a34a81a1
a68a34 . Second, instead of
doing a pick-first (Soon et al., 2001) or best-first (Ng and
Cardie, 2002) selection, the mention-pair linking model
is used to compute a starting probability. The starting
probability enables us to score the action of creating a
new entity without thresholding the link probabilities.
Third, this probabilistic framework allows us to search
the space of all possible entities, while Soon et al. (2001),
Ng and Cardie (2002) take the “best” local hypothesis.
4 Experimental Results
The data used in all experiments presented in this sec-
tion is provided by the Linguistic Data Consortium and is
distributed by NIST to all participants in the ACE evalua-
tion. In the comparative experiments for the mention de-
tection and entity tracking tasks, the training data for the
English system consists of the training data from both the
2002 evaluation and the 2003 evaluation, while for Ara-
bic and Chinese, new additions to the ACE task in 2003,
consists of 80% of the provided training data. Table 2
shows the sizes of the training, development and eval-
uation test data for the 3 languages. The data is anno-
tated with five types of entities: person, organization,
geo-political entity, location, facility; each mention can
be either named, nominal or pronominal, and can be ei-
ther generic (not referring to a clearly described entity)
or specific.
The models for all three languages are built as joint
models, simultaneously predicting the type, level and
genericity of a mention – basically each mention is la-
beled with a 3-pronged tag. To transform the problem
into a classification task, we use the IOB2 classification
scheme (Tjong Kim Sang and Veenstra, 1999).
4.1 The ACE Value
A gauge of the performance of an EDT system is the ACE
value, a measure developed especially for this purpose. It
estimates the normalized weighted cost of detection of
specific-only entities in terms of misses, false alarms and
substitution errors (entities marked generic are excluded
from computation): any undetected entity is considered
a miss, system-output entities with no corresponding ref-
erence entities are considered false alarms, and entities
whose type was mis-assigned are substitution errors. The
ACE value computes a weighted cost by applying differ-
ent weights to each error, depending on the error type and
target entity type (e.g. PERSON-NAMEs are weighted
a lot more heavily than FACILITY-PRONOUNs) (NIST,
2003a). The cumulative cost is normalized by the cost
of a (hypothetical) system that outputs no entities at all
– which would receive an ACE value of a21 . Finally, the
normalized cost is subtracted from 100.0 to obtain the
ACE value; a value of 100% corresponds to perfect en-
tity detection. A system can obtain a negative score if it
proposed too many incorrect entities.
In addition, for the mention detection task, we will also
present results by using the more established F-measure,
computed as the harmonic mean of precision and recall
– this measure gives equal importance to all entities, re-
gardless of their type, level or genericity.
4.2 EDT Results
As described in Section 2.6, the mention detection sys-
tems make use of a large set of features. To better assert
the contribution of the different types of features to the fi-
nal performance, we have grouped them into 4 categories:
1. Surface features: lexical features that can be derived
from investigating the words: words, morphs, pre-
fix/suffix, capitalization/word-form flags
2. Features derived from processing the data with NLP
techniques: POS tags, text chunks, word segmenta-
tion, etc.
3. Gazetteer/dictionary features
4. Features obtained by running other named-entity
classifiers (with different tag sets): HMM, MaxEnt
and RRM output on the 32-category, 49-category
and MUC data sets.9
Table 3 presents the mention detection comparative re-
sults, F-measure and ACE value, on Arabic and Chinese
data. The Arabic and Chinese models were built using
9In the English MaxEnt system, which uses 295k features,
the distribution among the four classes of features is: 1:72%,
2:24%, 3:1%, 4:3%.
Feature Arabic Chinese
Sets F-measure ACE F-measure ACE
1 59.7 43.1 62.6 51.1
1+2 60.8 46.0 67.1 57.7
1+2+3 63.4 51.8 68.4 67.7
1+2+3+4 68.5 53.2 68.6 74.1
Table 3: Mention detection results for the Arabic and
Chinese
Arabic Chinese English
Feb02 Sept02
ACE value 83.2 89.4 90.9 88.0
Table 4: Entity tracking results on true mentions
the RRM model. There are some interesting observa-
tions: first, the F-measure performance does not correlate
well with an improvement in ACE value – small improve-
ments in F-measure sometimes are paired with large rela-
tive improvements in ACE value, fact due to the different
weighting of entity types. Second, the largest single im-
provement in ACE value is obtained by adding dictionary
features, at least in this order of adding features.
For English, we investigated in more detail the way
features interact. Figure 1 presents a hierarchical direct
comparison between the performance of the RRM model
and the MaxEnt model. We can observe that the RRM
model makes better use of gazetteers, and manages to
close the initial performance gap to the MaxEnt model.
Table 4 presents the results obtained by running the en-
tity tracking algorithm on true mentions. It is interesting
to compare the entity tracking results with inter-annotator
agreements. LDC reported (NIST, 2003b) that the inter-
annotator agreement (computed as ACE-values) between
annotators are a0a2a1 a10a4 %, a3a4a0 a10a1 % and a3a4a0 a10a3 % for Arabic, Chi-
nese and English, respectively. The system performance
is very close to human performance on this task; this
small difference in performance highlights the difficulty
of the entity tracking task.
Finally, Table 5 presents the results obtained by run-
ning both mention detection followed by entity tracking
on the ACE’03 evaluation data. Our submission in the
evaluation performed well relative to the other partici-
pating systems (contractual obligations prevent us from
elaborating further).
4.3 Discussion
The same basic model was used to perform EDT in three
languages. Our approach is language-independent, in that
Arabic Chinese English
RRM MaxEnt
ACE value 54.5 58.8 69.7 73.4
Table 5: ACE value results for the three languages on
ACE’03 evaluation data.
73.2
71.3
English
70.8 70.7
1+2
MaxEntRRM
73.4
69.1 70.4
1+41+3
72.6 72.5
1+2+3
1
1+2+4 1+3+4
72.172.1 72.0 73.2
72.571.871.4
1+2+3+4
Figure 1: Performance of the English mention detection
system on different sets of features (uniformly penalized
F-measure), September’02 data. The lower part of each
box describes the particular combination of feature types;
the arrows show a inclusion relationship between the fea-
ture sets.
the fundamental classification algorithm can be applied to
every language and the only changes involve finding ap-
propriate and available feature streams for each language.
The entity tracking system uses even fewer language-
specific features than the mention detection systems.
One limitation apparent in our mention detection sys-
tem is that it does not model explicitly the genericity of
a mention. Deciding whether a mention refers to a spe-
cific entity or a generic entity requires knowledge of sub-
stantially wider context than the window of 5 tokens we
currently use in our mention detection systems. One way
we plan to improve performance for such cases is to sep-
arate the task into two parts: one in which the mention
type and level are predicted, followed by a genericity-
predicting model which uses long-range features, such as
sentence or document level features.
Our entity tracking system currently cannot resolve the
coreference of pronouns very accurately. Although this is
weighted lightly in ACE evaluation, good anaphora res-
olution can be very useful in many applications and we
will continue exploring this task in the future.
The Arabic and Chinese EDT tasks were included in
the ACE evaluation for the first time in 2003. Unlike
the English case, the systems had access to only a small
amount of training data (60k words for Arabic and 90k
characters for Chinese, in contrast with 340k words for
English), which made it difficult to train statistical mod-
els with large number of feature types. Future ACE evalu-
ations will shed light on whether this lower performance,
shown in Table 3, is due to lack of training data or to
specific language-specific ambiguity.
The final observation we want to make is that the sys-
tems were not directly optimized for the ACE value, and
there is no obvious way to do so. As Table 3 shows, the
F-measure and ACE value do not correlate well: systems
trained to optimize the former might not end up optimiz-
ing the latter. It is an open research question whether a
system can be directly optimized for the ACE value.
5 Conclusion
This paper presents a language-independent framework
for the entity detection and tracking task, which is shown
to obtain top-tier performance on three radically differ-
ent languages: Arabic, Chinese and English. The task is
separated into two sub-tasks: a mention detection part,
which is modeled through a named entity-like approach,
and an entity tracking part, for a which a novel modeling
approach is proposed.
This statistical framework is general and can incor-
porate heterogeneous feature types — the models were
built using a wide array of lexical, syntactic and seman-
tic features extracted from texts, and further enhanced
by adding the output of pre-existing semantic classifiers
as feature streams; additional feature types help improve
the performance significantly, especially in terms of ACE
value. The experimental results show that the systems
perform remarkably well, for both well investigated lan-
guages, such as English, and for the relatively new addi-
tions Arabic and Chinese.
6 Acknowledgements
We would like to thank Dr. Tong Zhang for providing us
with the RRM toolkit.
This work was partially supported by the Defense
Advanced Research Projects Agency and monitored by
SPAWAR under contract No. N66001-99-2-8916. The
views and findings contained in this material are those
of the authors and do not necessarily reflect the position
of policy of the U.S. government and no official endorse-
ment should be inferred.
References
J. Aberdeen, D. Day, L. Hirschman, P. Robinson, and
M. Vilain. 1995. Mitre: Description of the Alembic
system used for MUC-6. In Proceedings of MUC-6,
pages 141–155.
E. T. Bell. 1934. Exponential numbers. American Math.
Monthly, 41:411–419.
A. Berger, S. Della Pietra, and V. Della Pietra. 1996. A
maximum entropy approach to natural language pro-
cessing. Computational Linguistics, 22(1):39–71.
D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel.
1997. Nymble: a high-performance learning name-
finder. In Proceedings of ANLP-97, pages 194–201.
A. Borthwick, J. Sterling, E. Agichtein, and R. Grish-
man. 1998. Exploiting diverse knowledge sources via
maximum entropy in named entity recognition.
A. Ittycheriah, L. Lita, N. Kambhatla, N. Nicolov,
S. Roukos, and M. Stys. 2003. Identifying and track-
ing entity mentions in a maximum entropy framework.
In HLT-NAACL 2003: Short Papers, May 27 - June 1.
H. Jing, R. Florian, X. Luo, T. Zhang, and A. Itty-
cheriah. 2003. HowtogetaChineseName(Entity): Seg-
mentation and combination issues. In Proceedings of
EMNLP’03, pages 200–207.
Y.-S. Lee, K. Papineni, S. Roukos, O. Emam, and
H. Hassan. 2003. Language model based Arabic word
segmentation. In Proceedings of the ACL’03, pages
399–406.
A. Mikheev, M. Moens, and C. Grover. 1999. Named
entity recognition without gazetteers. In Proceedings
of EACL’99.
S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwarz,
R. Stone, and R. Weischedel. 1998. Bbn: Description
of the SIFT system as used for MUC-7. In MUC-7.
G. A. Miller. 1995. WordNet: A lexical database. Com-
munications of the ACM, 38(11).
V. Ng and C. Cardie. 2002. Improving machine learning
approaches to coreference resolution. In Proceedings
of the ACL’02, pages 104–111.
NIST. 2003a. The ACE evaluation plan.
www.nist.gov/speech/tests/ace/index.htm.
NIST. 2003b. Proceedings of ACE’03. Booklet, Alexan-
dria, VA, September.
L. Ramshaw and M. Marcus. 1994. Exploring the sta-
tistical derivation of transformational rule sequences
for part-of-speech tagging. In Proceedings of the ACL
Workshop on Combining Symbolic and Statistical Ap-
proaches to Language, pages 128–135.
L. Ramshaw and M. Marcus. 1995. Text chunking us-
ing transformation-based learning. In Proceedings of
WVLC’95, pages 82–94.
W. M. Soon, H. T. Ng, and C. Y. Lim. 2001. A machine
learning approach to coreference resolution of noun
phrases. Computational Linguistics, 27(4):521–544.
E. F. Tjong Kim Sang and J. Veenstra. 1999. Represent-
ing text chunks. In Proceedings of EACL’99.
E. F. Tjong Kim Sang. 2002. Introduction to the CoNLL-
2002 shared task: Language-independent named en-
tity recognition. In Proceedings of CoNLL-2002,
pages 155–158.
R. J. Williams and J. Peng. 1990. An efficient
gradient–based algorithm for on–line training of re-
current neural networks trajectories. Neural Compu-
tation, 2(4):490–501.
T. Zhang, F. Damerau, and D. E. Johnson. 2002. Text
chunking based on a generalization of Winnow. Jour-
nal of Machine Learning Research, 2:615–637.
