Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 97–104, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Large-Scale Exploration of Effective Global Features
for a Joint Entity Detection and Tracking Model
Hal Daum·e III and Daniel Marcu
Information Sciences Institute
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
a0 hdaume,marcu
a1 @isi.edu
Abstract
Entity detection and tracking (EDT) is
the task of identifying textual mentions
of real-world entities in documents, ex-
tending the named entity detection and
coreference resolution task by consider-
ing mentions other than names (pronouns,
de nite descriptions, etc.). Like NE tag-
ging and coreference resolution, most so-
lutions to the EDT task separate out the
mention detection aspect from the corefer-
ence aspect. By doing so, these solutions
are limited to using only local features for
learning. In contrast, by modeling both
aspects of the EDT task simultaneously,
we are able to learn using highly com-
plex, non-local features. We develop a
new joint EDT model and explore the util-
ity of many features, demonstrating their
effectiveness on this task.
1 Introduction
In many natural language applications, such as au-
tomatic document summarization, machine transla-
tion, question answering and information retrieval,
it is advantageous to pre-process text documents to
identify references to entities. An entity, loosely
de ned, is a person, location, organization or geo-
political entity (GPE) that exists in the real world.
Being able to identify references to real-world enti-
ties of these types is an important and dif cult natu-
ral language processing problem. It involves  nding
text spans that correspond to an entity, identifying
what type of entity it is (person, location, etc.), iden-
tifying what type of mention it is (name, nominal,
pronoun, etc.) and  nally identifying which other
mentions in the document it corefers with. The dif-
 culty lies in the fact that there are often many am-
biguous ways to refer to the same entity. For exam-
ple, consider the two sentences below:
Bill ClintonNAMPER 1 gave a speech today to
the SenateNAMORG 2 . The PresidentNOMPER 1 outlined
hisPROPER 1 plan for budget reform to themPROORG 2 .
There are  ve entity mentions in these two sen-
tences, each of which is underlined (the correspond-
ing mention type and entity type appear as super-
scripts and subscripts, respectively, with coreference
chains marked in the subscripts), but only two enti-
ties: a2 Bill Clinton, The president, his a3 and a2 the
Senate, them a3 . The mention detection task is to
identify the entity mentions and their types, without
regard for the underlying entity sets, while corefer-
ence resolution groups a given mentions into sets.
Current state-of-the-art solutions to this problem
split it into two parts: mention detection and coref-
erence (Soon et al., 2001; Ng and Cardie, 2002; Flo-
rian et al., 2004). First, a model is run that attempts
to identify each mention in a text and assign it a type
(person, organization, etc.). Then, one holds these
mentions  xed and attempts to identify which ones
refer to the same entity. This is typically accom-
plished through some form of clustering, with clus-
tering weights often tuned through some local learn-
ing procedure. This pipelining scheme has the sig-
ni cant drawback that the mention detection module
cannot take advantage of information from the coref-
erence module. Moreover, within the coreference
97
task, performing learning and clustering as separate
tasks makes learning rather ad-hoc.
In this paper, we build a model that solves the
mention detection and coreference problems in a
simultaneous, joint manner. By doing so, we are
able to obtain an empirically superior system as well
as integrate a large collection of features that one
cannot consider in the standard pipelined approach.
Our ability to perform this modeling is based on the
Learning as Search Optimization framework, which
we review in Section 2. In Section 3, we describe
our joint EDT model in terms of the search proce-
dure executed. In Section 4, we describe the features
we employ in this model; these include the stan-
dard lexical, semantic (WordNet) and string match-
ing features found in most other systems. We ad-
ditionally consider many other feature types, most
interestingly count-based features, which take into
account the distribution of entities and mentions
(and are not expressible in the binary classi cation
method for coreference) and knowledge-based fea-
tures, which exploit large corpora for learning name-
to-nominal references. In Section 5, we present our
experimental results. First, we compare our joint
system with a pipelined version of the system, and
show that joint inference leads to improved perfor-
mance. Next, we perform an extensive feature com-
parison experiment to determine which features are
most useful for the coreference task, showing that
our newly introduced features provide useful new in-
formation. We conclude in Section 6.
2 Learning as Search Optimization
When one attempts to apply current, standard ma-
chine learning algorithms to problems with combi-
natorial structured outputs, the resulting algorithm
implicitly assumes that it is possible to  nd the
best structures for a given input (and some model
parameters). Furthermore, most models require
much more, either in the form of feature expecta-
tions for conditional likelihood-based methods (Laf-
ferty et al., 2001) or local marginal distributions
for margin-based methods (Taskar et al., 2003). In
many cases including EDT and coreference this
is a false assumption. Often, we are not able to  nd
the best solution, but rather must employ an approx-
imate search to  nd the best possible solution, given
time and space constraints. The Learning as Search
Algo Learn(problem, initial, enqueue, a4 , a5 , a6 )
nodes a7 MakeQueue(MakeNode(problem,initial))
while nodes is not empty do
node a7 RemoveFront(nodes)
if none of nodes a8a10a9 nodea11 is a6 -good or
GoalTest(node) and node is not a6 -good then
sibs a7 siblings(node, a6 )
a4a12a7 update(a4 , a5 , sibs, node a8 nodes)
nodes a7 MakeQueue(sibs)
else
if GoalTest(node) then return a4
next a7 Operators(node)
nodes a7 enqueue(problem, nodes, next, a4 )
end if
end while
Figure 1: The generic search/learning algorithm.
Optimization (LaSO) framework exploits this dif -
culty as an opportunity and seeks to  nd model pa-
rameters that are good within the context of search.
More formally, following the LaSO framework,
we assume that there is a set of input structures a13
and a set of output structures a14 (in our case, ele-
ments a15a17a16a17a13 will be documents and elements a18a19a16a20a14
will be documents marked up with mentions and
their coreference sets). Additionally, we provide the
structure of a search space a21 that results in elements
of a14 (we will discuss our choice for this component
later in Section 3). The LaSO framework relies on
a monotonicity assumption: given a structure a18a20a16a17a14
and a node a22 in the search space, we must be able
to calculate whether it is possible for this node a22 to
eventually lead to a18 (such nodes are called a18 -good).
LaSO parameterizes the search process with a
weight vector a23a24a16a26a25a28a27 , where weights correspond
to features of search space nodes and inputs. Specif-
ically, we write a29a31a30a32a13a34a33a20a21a36a35a37a25a38a27 as a function that
takes a pair of an input a15 and a node in the search
space a22 and produces a vector of features. LaSO
takes a standard search algorithm and modi es it to
incorporate learning in an online manner to the algo-
rithm shown in Figure 1. The key idea is to perform
search as normal until a point at which it becomes
impossible to reach the correct solution. When this
happens, the weight vector a23 is updated in a correc-
tive fashion. The algorithm relies on a parameter up-
date formula; the two suggested by (Daum·e III and
Marcu, 2005) are a standard Perceptron-style update
and an approximate large margin update of the sort
proposed by (Gentile, 2001). In this work, we only
use the large margin update, since in the original
LaSO work, it consistently outperformed the sim-
98
pler Perceptron updates. The update has the form
given below:
a4a39a7 proj a40a41a4a43a42a45a44a47a46a49a48a51a50a53a52a55a54a57a56a59a58
a56a31a60 proj
a61a63a62
a64a66a65 sibs
a67
a40a41a5a51a68a70a69a32a58
a71sibsa71a26a72
a62
a64a73a65 nodes
a67
a40a41a5a51a68a74a69a32a58
a71nodesa71a76a75
Where a77 is the update number, a78 is a tunable param-
eter and proj projects a vector into the unit sphere.
3 Joint EDT Model
The LaSO framework essentially requires us to spec-
ify two components: the search space (and corre-
sponding operations) and the features. These two are
inherently tied, since the features rely on the search
space, but for the time being we will ignore the issue
of the feature functions and focus on the search.
3.1 Search Space
We structure search in a left-to-right decoding
framework: a hypothesis is a complete identi ca-
tion of the initial segment of a document. For in-
stance, on a document with a79 words, a hypothesis
that ends at position a80a45a81a82a22a36a81a31a79 is essentially what
you would get if you took the full structured output
and chopped it off at word a22 . In the example given in
the introduction, one hypothesis might correspond to
 Bill Clinton gave a (which would be a a18 -good hy-
pothesis), or to  Bill Clinton gave a (which would
not be a a18 -good hypothesis).
A hypothesis is expanded through the application
of the search operations. In our case, the search pro-
cedure  rst chooses the number of words it is going
to consume (for instance, to form the mention  Bill
Clinton, it would need to consume two words).
Then, it decides on an entity type and a mention type
(or it opts to call this chunk not an entity (NAE), cor-
responding to non-underlined words). Finally, as-
suming it did not choose to form an NAE, it decides
on which of the foregoing coreference chains this
entity belongs to, or none (if it is the  rst mention of
a new entity). All these decisions are made simulta-
neously, and the given hypothesis is then scored.
3.2 An Example
For concreteness, consider again the text given in
the introduction. Suppose that we are at the word
 them and the hypothesis we are expanding is cor-
rect. That is, we have correctly identi ed  Bill Clin-
ton with entity type  person and mention type
 name; that we have identi ed  the Senate with
entity type  organization and mention type  name; 
and that we have identi ed both  The President and
 his as entities with entity type  person and men-
tion types  nominal and  pronoun, respectively,
and that  The President points back to the chain
a83 Bill Clinton
a84 and that  his points back to the chain
a83 Bill Clinton, The President
a84 .
At this point of search, we have two choices for
length: one or two (because there are only two words
left:  them and a period). A  rst hypothesis would
be that the word  them is NAE. A second hypothe-
sis would be that  them is a named person and is a
new entity; a third hypothesis would be that  them 
is a named person and is coreference with the  Bill
Clinton chain; a fourth hypothesis would be that
 them is a pronominal organization and is a new
entity; next,  them could be a pronominal organiza-
tion that is coreferent with  the Senate ; and so on.
Similar choices would be considered for the string
 them . when two words are selected.
3.3 Linkage Type
One signi cant issue that arises in the context of as-
signing a hypothesis to a coreference chain is how to
compute features over that chain. As we will discuss
in Section 4, the majority of our coreference-speci c
features are over pairs of chunks: the proposed new
mention and an antecedent. However, since in gen-
eral a proposed mention can have well more than one
antecedent, we are left with a decision about how to
combine this information.
The  rst, most obvious solution, is to essentially
do nothing: simply compute the features over all
pairs and add them up as usual. This method, how-
ever, intuitively has the potential for over-counting
the effects of large chains. To compensate for this,
one might advocate the use of an average link com-
putation, where the score for a coreference chain is
computed by averaging over its elements. One might
also consider a max link or min link scenario, where
one of the extrema is chosen as the value. Other re-
search has suggested that a simple last link, where a
mention is simply matched against the most recent
mention in a chain might be appropriate, while  rst
link might also be appropriate because the  rst men-
tion of an entity tends to carry the most information.
In addition to these standard linkages, we also
99
consider an intelligent link scenario, where the
method of computing the link structure depends on
the mention type. The intelligent link is computed
as follow, based on the mention type of the current
mention, a85 :
If a85a87a86 NAM then: match  rst on NAM elements
in the chain; if there are none, match against the
last NOM element; otherwise, use max link.
If a85a87a86 NOM then: match against the max NOM in
the chain; otherwise, match against the most
last NAM; otherwise, use max link.
If a85a87a86 PRO then: use average link across all PRO
or NAM; if there are none, use max link.
The construction of this methodology as guided
by intuition (for instance, matching names against
names is easy, and the  rst name tends to be the most
complete) and subsequently tuned by experimenta-
tion on the development data. One might consider
learning the best link method, and this may result in
better performance, but we do not explore this op-
tion in this work. The initial results we present will
be based on using intelligent link, but we will also
compare the different linkage types explicitly.
4 Feature Functions
All the features we consider are of the form base-
feature a33 decision-feature, where base features are
functions of the input and decisions are functions of
the hypothesis. For instance, a base feature might be
something like  the current chunk contains the word
’Clinton ’ and a decision feature might be some-
thing like  the current chunk is a named person. 
4.1 Base Features
For pedagogical purposes and to facility model com-
parisons, we have separated the base features into
eleven classes: lexical, syntactic, pattern-based,
count-based, semantic, knowledge-based, class-
based, list-based, inference-based, string match fea-
tures and history-based features. We will deal with
each of these in turn. Finally, we will discuss how
these base features are combined into meta-features
that are actually used for prediction.
Lexical features. The class of lexical features
contains simply computable features of single
words. This includes: the number of words in the
current chunk; the unigrams (words) contained in
this chunk; the bigrams; the two character pre xes
and suf xes; the word stem; the case of the word,
computed by regular expressions like those given by
(Bikel et al., 1999); simple morphological features
(number, person and tense when applicable); and, in
the case of coreference, pairs of features between the
current mention and an antecedent.
Syntactic features. The syntactic features are
based on running an in-house state of the art part
of speech tagger and syntactic chunker on the data.
The words include unigrams and bigrams of part of
speech as well as unigram chunk features. We have
not used any parsing for this task.
Pattern-based features. We have included a
whole slew of features based on lexical and part of
speech patterns surrounding the current word. These
include: eight hand-written patterns for identifying
pleonastic  it and  that (as in  It is raining or
 It seems to be the case that . . .  ); identi cation
of pluralization features on the previous and next
head nouns (this is intended to help make decisions
about entity types); the previous and next content
verb (also intended to help with entity type identi-
 cation); the possessor or possessee in the case of
simple possessive constructions ( The president ’s
speech would yield a feature of  president on the
word  speech , and vice-versa; this is indented to
be a sort of weak sub-categorization principle); a
similar feature but applied to the previous and next
content verbs (again to provide a weak sort of sub-
categorization); and, for coreference, a list of part of
speech and word sequence patterns that match up to
four words between nearby mentions that are either
highly indicative of coreference (e.g.,  of,  said, 
 am  , a ) or highly indicative of non-coreference
(e.g.,  ’s,  and,  in the,  and the ). This last set
was generated by looking at intervening strings and
 nding the top twenty that had maximal mutual in-
formation with with class (coreferent or not corefer-
ent) across the training data.
Count-based features. The count-based features
apply only to the coreference task and attempt to
capture regularities in the size and distribution of
coreference chains. These include: the total num-
ber of entities detected thus far; the total number
of mentions; the entity to mention ratio; the entity
100
to word ratio; the mention to word ratio; the size
of the hypothesized entity chain; the ratio of the
number of mentions in the current entity chain to
the total number of mentions; the number of inter-
vening mentions between the current mention and
the last one in our chain; the number of intervening
mentions of the same type; the number of interven-
ing sentence breaks; the Hobbs distance computed
over syntactic chunks; and the  decayed density 
of the hypothesized entity, which is computed as
a88a82a89a91a90a93a92
a80a95a94a97a96a99a98a101a100
a89a38a102a104a103
a88a31a89
a80a95a94a97a96a99a98a101a100
a89a91a102
, where a85 ranges over
all previous mentions (constrained in the numerator
to be in the same coreference chain as our mention)
and a105a107a106a108a85a17a109 is the number of entities away this men-
tion is. This feature is captures that some entities
are referred to consistently across a document, while
others are mentioned only for short segments, but it
is relatively rare for an entity to be mentioned once
at the beginning and then ignored again until the end.
Semantic features. The semantic features used
are drawn from WordNet (Fellbaum, 1998). They
include: the two most common synsets from Word-
Net for all the words in a chunk; all hypernyms of
those synsets; for coreference, we also consider the
distance in the WordNet graph between pairs of head
words (de ned to be the  nal word in the mention
name) and whether one is a part of the other. Finally,
we include the synset and hypernym information of
the preceding and following verbs, again to model a
sort of sub-categorization principle.
Knowledge-based features. Based on the hypoth-
esis that many name to nominal coreference chains
are best understood in terms of background knowl-
edge (for instance, that  George W. Bush is the
 President ), we have attempted to take advantage
of recent techniques from large scale data mining
to extract lists of such pairs. In particular, we use
the name/instance lists described by (Fleischman et
al., 2003) and available on Fleischman’s web page to
generate features between names and nominals (this
list contains a110a111a85 pairs mined from a112a73a96 GBs of news
data). Since this data set tends to focus mostly on
person instances from news, we have additionally
used similar data mined from a a112a73a113a49a114 GB web corpus,
for which more general  ISA relations were mined
(Ravichandran et al., 2005).
Class-based features. The class-based features
we employ are designed to get around the sparsity
of data problem while simultaneously providing new
information about word usage. The  rst class-based
feature we use is based on word classes derived from
the web corpus mentioned earlier and computed as
described by (Ravichandran et al., 2005). The sec-
ond attempts to instill knowledge of collocations in
the data; we use the technique described by (Dun-
ning, 1993) to compute multi-word expressions and
then mark words that are commonly used as such
with a feature that expresses this fact.
List-based features. We have gathered a collec-
tion of about 40 lists of common places, organiza-
tion, names, etc. These include the standard lists
of names gathered from census data and baby name
books, as well as standard gazetteer information list-
ing countries, cities, islands, ports, provinces and
states. We supplement these standard lists with
lists of airport locations (gathered from the FAA)
and company names (mined from the NASDAQ and
NYSE web pages). We additionally include lists of
semantically plural but syntactically singular words
(e.g.,  group ) which were mined from a large cor-
pus by looking for patterns such as ( members of the
. . .  ). Finally, we use a list of persons, organizations
and locations that were identi ed at least 100 times
in a large corpus by the BBN IdentiFinder named
entity tagger (Bikel et al., 1999).
These lists are used in three ways. First, we use
simple list membership as a feature to improve de-
tection performance. Second, for coreference, we
look for word pairs that appear on the same list but
are not identical (for instance,  Russia and  Eng-
land appearing on the  country list but not being
identical hints that they are different entities). Fi-
nally, we look for pairs where one element in the pair
is the head word from one mention and the other el-
ement in the pair is a list. This is intended to capture
the notion that a word that appears on out  country
list is often coreferent with the word  country. 
Inference-based features. The inference-based
features are computed by attempting to infer an un-
derlying semantic property of a given mention. In
particular, we attempt to identify gender and seman-
tic number (e.g.,  group is semantically plural al-
though it is syntactically singular). To do so, we cre-
101
ated a corpus of example mentions labels with num-
ber and gender, respectively. This data set was auto-
matically extracted from our EDT data set by look-
ing for words that corefer with pronouns for which
we know the number or gender. For instance, a men-
tion that corefers with  she is known to be singu-
lar and female, while a mention that corefers with
 they is known to be plural. In about 5% of the
cases, this was ambiguous  these cases were thrown
out. We then used essentially the same features as
described above to build a maximum entropy model
for predicting number and gender. The predictions
of this model are used both as features for detec-
tion as well as coreference (in the latter case, we
check for matches). Additionally, we use several
pre-existing classi ers as features. This are simple
maximum entropy Markov models trained off of the
MUC6 data, the MUC7 data and our ACE data.
String match features. We use the standard string
match features that are described in every other
coreference paper. These are: string match; sub-
string match; string overlap; pronoun match; and
normalized edit distance. In addition, we also use
a string nationality match, which matches, for in-
stance  Israel and  Israeli,  Russia and  Rus-
sian,  England and  English, but not  Nether-
lands and  Dutch. This is done by checking
for common suf xes on nationalities and match-
ing the  rst half of the of the words based on ex-
act match. We additionally use a linguistically-
motivated string edit distance, where the replace-
ment costs are lower for vowels and other easily con-
fusable characters. We also use the Jaro distance as
an additional string distance metric. Finally, we at-
tempt to match acronyms by looking at initial letters
from the words in long chunks.
History-based features. Finally, for the detection
phase of the task, we include features having to
do with long-range dependencies between words.
For instance, if at the beginning of the document
we tagged the word  Arafat as a person’s name
(perhaps because it followed  Mr. or  Palestinian
leader ), and later in the document we again see the
word  Arafat, we should be more likely to call this
a person’s name, again. Such features have previ-
ously been explored in the context of information
extraction from meeting announcements using con-
ditional random  elds augmented with long-range
links (Sutton and McCallum, 2004), but the LaSO
framework makes no Markov assumption, so there
is no extra effort required to include such features.
4.2 Decision Features
Our decision features are divided into three classes:
simple, coreference and boundary features.
Simple. The simple decision features include: is
this chunk tagged as an entity; what is its entity type;
what is its entity subtype; what is its mention type;
what is its entity type/mention type pair.
Coreference. The coreference decision features
include: is this entity the start of a chain or con-
tinuing an existing chain; what is the entity type of
this started (or continued) chain; what is the entity
subtype of this started (or continued) chain; what is
the mention type of this started chain; what is the
mention type of this continued chain and the men-
tion type of the most recent antecedent.
Boundary. The boundary decision features in-
clude: the second and third order Markov features
over entity type, entity subtype and mention type;
features appearing at the previous (and next) words
within a window of three; the words that appear and
the previous and next mention boundaries, speci ed
also by entity type, entity subtype and mention type.
5 Experimental Results
5.1 Data
We use the of cial 2004 ACE training and test set
for evaluation purposes; however, we exclude from
the training set the Fisher conversations data, since
this is very different from the other data sets and
there is no Fisher data in the 2004 test set. This
amounts to a113a49a115a49a110 training documents, consisting of
a114a51a94a116a112a73a77 sentences and a112a73a117a99a80a118a77 words. There are a total
of a110a111a119a120a77 mentions in the data corresponding to a112a121a80a118a77
entities (note that the data is not annotated for cross-
document coreference, so instances of  Bill Clinton 
appearing in two different documents are counted as
two different entities). Roughly half of the entities
are people, a  fth are organizations, a  fth are GPEs
and the remaining are mostly locations or facilities.
The test data is a112a73a115a49a110 documents, a113a51a94a97a96a49a77 sentences and
a117a111a119a120a77 words, with a112a121a80a118a77 mentions to a119a122a94a97a96a49a77 entities. In
all cases, we use a beam of 16 for training and test,
102
and ignore features that occur fewer than  ve times
in the training data.
5.2 Evaluation Metrics
There are many evaluation metrics possible for this
data. We will use as our primary measure of quality
the ACE metric. This is computed, roughly, by  rst
matching system mentions with reference mentions,
then using those to match system entities with ref-
erence entities. There are costs, once this matching
is complete, for type errors, false alarms and misses,
which are combined together to give an ACE score,
ranging from a80 to a112a121a80a49a80 , with a112a121a80a49a80 being perfect (we
use v.10 of the ACE evaluation script).
5.3 Joint versus Pipelined
We compare the performance of the joint system
with the pipelined system. For the pipelined sys-
tem, to build the mention detection module, we use
the same technique as for the full system, but sim-
ply don’t include in the hypotheses the coreference
chain information (essentially treating each mention
as if it were in its own chain). For the stand-alone
coreference system, we assume that the correct men-
tions and types are always given, and simply hypoth-
esize the chain (though still in a left-to-right man-
ner).1 Run as such, the joint model achieves an
ACE score of a123a99a115a51a94a124a119 and the pipelined model achieves
an ACE score of a123a99a114a51a94a116a112 , a reasonably substantial im-
provement for performing both task simultaneously.
We have also computed the performance of these
two systems, ignoring the coreference scores (this
is done by considering each mention to be its own
entity and recomputing the ACE score). In this
case, the joint model, ignoring its coreference out-
put, achieves an ACE score of a114a49a96a51a94a97a117 and the pipelined
model achieves a score of a114a49a96a51a94a97a113 . The joint model
1One subtle dif culty with the joint model has to do with
the online nature of the learning algorithm: at the beginning of
training, the model is guessing randomly at what words are enti-
ties and what words are not entities. Because of the large num-
ber of initial errors made in this part of the task, the weights
learned by the coreference model are initially very noisy. We
experimented with two methods for compensating for this ef-
fect. The  rst was to give the mention identi cation model as
 head start : it was run for one full pass through the training
data, ignoring the coreference aspect and the following itera-
tions were then trained jointly. The second method was to only
update the coreference weights when the mention was identi ed
correctly. On development data, the second was more ef cient
and outperformed the  rst by a125a66a126a127 ACE score, so we use this for
the experiments reported in this section.
Figure 2: Comparison of performance as different
feature classes are removed.
does marginally better, but it is unlikely to be sta-
tistically signi cant. In the 2004 ACE evaluation,
the best three performing systems achieved scores
of a123a99a115a51a94a97a115 , a123a99a115a51a94a128a123 and a123a99a114a51a94a97a110 ; it is unlikely that our system
is signi cantly worse than these.
5.4 Feature Comparison for Coreference
In this section, we analyze the effects of the differ-
ent base feature types on coreference performance.
We use a model with perfect mentions, entity types
and mention types (with the exception of pronouns:
we do not assume we know pronoun types, since
this gives away too much information), and measure
the performance of the coreference system. When
run with the full feature set, the model achieves an
ACE score of a114a49a115a51a94a116a112 and when run with no added fea-
tures beyond simple biases, it achieves a117a49a96a51a94a124a119 . The
best performing system in the 2004 ACE competi-
tion achieved a score of a115a95a112a99a94a97a96 on this task; the next
best system scored a114a49a114a51a94a97a110 , which puts us squarely in
the middle of these two (though, likely not statis-
tically signi cantly different). Moreover, the best
performing system took advantage of additional data
that they labeled in house.
To compute feature performance, we begin with
all feature types and iteratively remove them one-
by-one so that we get the best performance (we do
not include the  history features, since these are
not relevant to the coreference task). The results are
shown in Figure 2. Across the top line, we list the
ten feature classes. The  rst row of results shows
the performance of the system after removing just
103
one feature class. In this case, removing lexical fea-
tures reduces performance to a114a49a114a51a94a97a115 , while removing
string-match features reduces performance to a114a49a113a51a94a97a117 .
The non-shaded box (in this case, syntactic features)
shows the feature set that can be removed with the
least penalty in performance. The second row re-
peats this, after removing syntactic features.
As we can see from this  gure, we can freely re-
move syntax, semantics and classes with little de-
crease in performance. From that point, patterns are
dropped, followed by lists and inference, each with
a performance drop of about a80a95a94a124a119 or a80a95a94a97a96 . Removing
the knowledge based features results in a large drop
from a114a118a123a32a94a97a117 down to a114a49a96a51a94a97a117 and removing count-based
features drops the performance another a80a95a94a128a123 points.
Based on this, we can easily conclude that the most
important feature classes to the coreference problem
are, in order, string matching features, lexical fea-
tures, count features and knowledge-based features,
the latter two of which are novel to this work.
5.5 Linkage Types
As stated in the previous section, the coreference-
only task with intelligent link achieves an ACE score
of a114a49a115a51a94a116a112 . The next best score is with min link (a114a49a114a51a94a128a123 )
followed by average link with a score of a114a49a114a51a94a116a112 . There
is then a rather large drop with max link to a114a49a117a51a94a97a110 ,
followed by another drop for last link to a114a49a113a51a94a97a96 and
 rst link performs the poorest, scoring a114a95a112a99a94a97a96 .
6 Discussion
In this paper, we have applied the Learning as
Search Optimization (LaSO) framework to the entity
detection and tracking task. The framework is an ex-
cellent choice for this problem, due to the fact that
many relevant features for the coreference task (and
even for the mention detection task) are highly non-
local. This non-locality makes models like Markov
networks intractable, and LaSO provides an excel-
lent framework for tackling this problem. We have
introduced a large set of new, useful features for this
task, most speci cally the use of knowledge-based
features for helping with the name-to-nominal prob-
lem, which has led to a substantial improvement in
performance. We have shown that performing joint
learning for mention detection and coreference re-
sults in a better performing model that pipelined
learning. We have also provided a comparison of the
contributions of our various feature classes and com-
pared different linkage types for coreference chains.
In the process, we have developed an ef cient model
that is competitive with the best ACE systems.
Despite these successes, our model is not perfect:
the largest source of error is with pronouns. This
is masked by the fact that the ACE metric weights
pronouns low, but a solution to the EDT problem
should handle pronouns well. We intend to explore
more complex features for resolving pronouns, and
to incorporate these features into our current model.
We also intend to explore more complex models for
automatically extracting knowledge from data that
can help with this task and applying this technique
to a real application, such as summarization.
Acknowledgments: We thank three anonymous review-
ers for helpful comments. This work was supported by DARPA-
ITO grant NN66001-00-1-9814 and NSF grant IIS-0326276.
References
D. Bikel, R. Schwartz, and R. Weischedel. 1999. An algorithm
that learns what’s in a name. Machine Learning, 34.
H. Daum·e III and D. Marcu. 2005. Learning as search opti-
mization: Approximate large margin methods for structured
prediction. In ICML.
T. Dunning. 1993. Accurate methods for the statistics of sur-
prise and coincidence. Computational Linguistics, 19(1).
C. Fellbaum, editor. 1998. WordNet: An Electronic Lexical
Database. The MIT Press, Cambridge, MA.
M. Fleischman, E. Hovy, and A. Echihabi. 2003. Of ine strate-
gies for online question answering: Answering questions be-
fore they are asked. In ACL.
R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla,
X. Luo, N. Nicolov, and S. Roukos. 2004. A statisti-
cal model for multilingual entity detection and tracking. In
NAACL/HLT.
C. Gentile. 2001. A new approximate maximal margin classi -
cation algorithm. JMLR, 2:213 242.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional
random  elds: Probabilistic models for segmenting and la-
beling sequence data. In ICML.
V. Ng and C. Cardie. 2002. Improving machine learning ap-
proaches to coreference resolution. In ACL.
D. Ravichandran, P. Pantel, and E. Hovy. 2005. Randomized
algorithms and NLP: Using locality sensitive hash functions
for high speed noun clustering. In ACL.
W. Soon, H. Ng, and D. Lim. 2001. A machine learning ap-
proach to coreference resolution of noun phrases. Computa-
tional Linguistics, 27(4):521  544.
C. Sutton and A. McCallum. 2004. Collective segmentation
and labeling of distant entities in information extraction. In
ICML workshop on Statistical Relational Learning.
B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin
Markov networks. In NIPS.
104
