A Mention-Synchronous Coreference Resolution Algorithm Based on the
Bell Tree
Xiaoqiang Luo and Abe Ittycheriah
Hongyan Jing and Nanda Kambhatla and Salim Roukos
1101 Kitchawan Road
Yorktown Heights, NY 10598, U.S.A.
{xiaoluo,abei,hjing,nanda,roukos}@us.ibm.com
Abstract
This paper proposes a new approach for
coreference resolution which uses the Bell
tree to represent the search space and casts
the coreference resolution problem as finding
the best path from the root of the Bell tree to
the leaf nodes. A Maximum Entropy model
is used to rank these paths. The coreference
performance on the 2002 and 2003 Auto-
matic Content Extraction (ACE) data will be
reported. We also train a coreference system
using the MUC6 data and competitive results
are obtained.
1 Introduction
In this paper, we will adopt the terminologies used in
the Automatic Content Extraction (ACE) task (NIST,
2003). Coreference resolution in this context is defined
as partitioning mentions into entities. A mention is an
instance of reference to an object, and the collection
of mentions referring to the same object in a document
form an entity. For example, in the following sentence,
mentions are underlined:
“The American Medical Association voted
yesterday to install the heir apparent as its
president-elect, rejecting a strong, upstart
challenge by a District doctor who argued
that the nation’s largest physicians’ group
needs stronger ethics and new leadership.”
“American Medical Association”, “its” and “group”
belong to the same entity as they refer to the same ob-
ject.
Early work of anaphora resolution focuses on find-
ing antecedents of pronouns (Hobbs, 1976; Ge et al.,
1998; Mitkov, 1998), while recent advances (Soon et
al., 2001; Yang et al., 2003; Ng and Cardie, 2002; Itty-
cheriah et al., 2003) employ statistical machine learn-
ing methods and try to resolve reference among all
kinds of noun phrases (NP), be it a name, nominal, or
pronominal phrase – which is the scope of this paper
as well. One common strategy shared by (Soon et al.,
2001; Ng and Cardie, 2002; Ittycheriah et al., 2003) is
that a statistical model is trained to measure how likely
a pair of mentions corefer; then a greedy procedure is
followed to group mentions into entities. While this ap-
proach has yielded encouraging results, the way men-
tions are linked is arguably suboptimal in that an instant
decision is made when considering whether two men-
tions are linked or not.
In this paper, we propose to use the Bell tree to rep-
resent the process of forming entities from mentions.
The Bell tree represents the search space of the coref-
erence resolution problem – each leaf node corresponds
to a possible coreference outcome. We choose to model
the process from mentions to entities represented in the
Bell tree, and the problem of coreference resolution is
cast as finding the “best” path from the root node to
leaves. A binary maximum entropy model is trained to
compute the linking probability between a partial entity
and a mention.
The rest of the paper is organized as follows. In
Section 2, we present how the Bell tree can be used
to represent the process of creating entities from men-
tions and the search space. We use a maximum en-
tropy model to rank paths in the Bell tree, which is dis-
cussed in Section 3. After presenting the search strat-
egy in Section 4, we show the experimental results on
the ACE 2002 and 2003 data, and the Message Under-
standing Conference (MUC) (MUC, 1995) data in Sec-
tion 5. We compare our approach with some recent
work in Section 6.
2 Bell Tree: From Mention to Entity
Let us consider traversing mentions in a document from
beginning (left) to end (right). The process of form-
ing entities from mentions can be represented by a tree
structure. The root node is the initial state of the pro-
cess, which consists of a partial entity containing the
first mention of a document. The second mention is
[1][2] 3*
[1][2][3]
[1] [23]
[13][2]
[123]
[12][3][1] 2* 3 [1]
[12]
[1]
[2]
(c1)
(c5)
(b1)
(c2)
(c3)
(c4)
(a) [12] 3*
(b2)
Figure 1: Bell tree representation for three mentions:
numbers in [] denote a partial entity. In-focus entities
are marked on the solid arrows, and active mentions
are marked by *. Solid arrows signify that a mention
is linked with an in-focus partial entity while dashed
arrows indicate starting of a new entity.
added in the next step by either linking to the exist-
ing entity, or starting a new entity. A second layer
of nodes are created to represent the two possible out-
comes. Subsequent mentions are added to the tree in
the same manner. The process is mention-synchronous
in that each layer of tree nodes are created by adding
one mention at a time. Since the number of tree leaves
is the number of possible coreference outcomes and it
equals the Bell Number (Bell, 1934), the tree is called
the Bell tree. The Bell Number a0a2a1a4a3a6a5 is the num-
ber of ways of partitioning a3 distinguishable objects
(i.e., mentions) into non-empty disjoint subsets (i.e.,
entities). The Bell Number has a “closed” formula
a0a2a1a4a3a6a5a8a7a10a9a11a13a12a15a14a16a18a17a20a19
a16a22a21
a16a24a23 and it increases rapidly as a3 in-
creases: a0a2a1a26a25a28a27a29a5a31a30a33a32a35a34a25a37a36a15a38a39a27 a9a41a40 ! Clearly, an efficient
search strategy is necessary, and it will be addressed in
Section 4.
Figure 1 illustrates how the Bell tree is created for
a document with three mentions. The initial node con-
sists of the first partial entity [1] (i.e., node (a) in Fig-
ure 1). Next, mention2becomes active (marked by “*”
in node (a)) and can either link with the partial entity
[1] and result in a new node (b1), or start a new entity
and create another node (b2). The partial entity which
the active mention considers linking with is said to be
in-focus. In-focus entities are highlighted on the solid
arrows in Figure 1. Similarly, mention 3 will be ac-
tive in the next stage and can take five possible actions,
which create five possible coreference results shown in
node (c1) through (c5).
Under the derivation illustrated in Figure 1, each leaf
node in the Bell tree corresponds to a possible corefer-
ence outcome, and there is no other way to form enti-
ties. The Bell tree clearly represents the search space
of the coreference resolution problem. The corefer-
ence resolution can therefore be cast equivalently as
finding the “best” leaf node. Since the search space is
large (even for a document with a moderate number of
mentions), it is difficult to estimate a distribution over
leaves directly. Instead, we choose to model the pro-
cess from mentions to entities, or in other words, score
paths from the root to leaves in the Bell tree.
A nice property of the Bell tree representation is that
the number of linking or starting steps is the same for
all the hypotheses. This makes it easy to rank them us-
ing the “local” linking and starting probabilities as the
number of factors is the same. The Bell tree represen-
tation is also incremental in that mentions are added
sequentially. This makes it easy to design a decoder
and search algorithm.
3 Coreference Model
3.1 Linking and Starting Model
We use a binary conditional model to compute the
probability that an active mention links with an in-
focus partial entity. The conditions include all the
partially-formed entities before, the focus entity index,
and the active mention.
Formally, let a42a39a43a45a44a31a46 a38a48a47a50a49a51a47a52a3a54a53 be a3 mentions
in a document. Mention index a49 represents the order
it appears in the document. Let a55a24a56 be an entity, and
a57
a46
a49a59a58a60a62a61 be the (many-to-one) map from mention
index a49 to entity index a61 . For an active mention index
a63 a1a41a38a64a47 a63 a47a65a3a6a5 , define
a66
a16
a7
a42a24a67a68a46a28a67
a7
a57
a1a69a49a70a5a22a71 for some a38a72a47a65a49a73a47 a63a75a74 a38a28a53a76a71
the set of indices of the partially-established entities to
the left of a43 a16 (note that a66 a9 a7a78a77 ), and
a79
a16
a7
a42a80a55a24a81a13a46a28a67a73a82
a66
a16
a53a29a71
the set of the partially-established entities. The link
model is then
a83a84a1a86a85a88a87a79
a16
a71
a43
a16
a71a90a89
a16
a7
a67
a5a18a71 (1)
the probability linking the active mention a43 a16 with the
in-focus entity a55 a81 . The random variable a89 a16 takes value
from the set a66 a16 and signifies which entity is in focus; a85
takes binary value and is a38 if a43 a16 links with a55a39a81 .
As an example, for the branch from (b2) to (c4) in
Figure 1, the active mention is “3”, the set of partial
entities to the left of “3” is a79 a40 a7 a42a92a91a38a18a93a94a71 a91a25a80a93a26a53 , the ac-
tive entity is the second partial entity “[2]”. Probability
a83a84a1a86a85a95a7a96a38a92a87a79
a40
a71a98a97a100a99a29a101a101a4a71a90a89
a40
a7a102a25a98a5 measures how likely men-
tion “3” links with the entity “[2].”
The model a83a84a1a86a85a88a87a79 a16 a71 a43 a16 a71a100a89 a16 a7 a67 a5 only computes
how likely a43 a16 links with a55 a81 ; It does not say anything
about the possibility that a43 a16 starts a new entity. Fortu-
nately, the starting probability can be computed using
link probabilities (1), as shown now.
Since starting a new entity means that a43 a16 does not
link with any entities in a79 a16 , the probability of starting
a new entity, a83a84a1a86a85a65a7 a27 a87a79 a16 a71 a43
a16
a5 , can be computed as
a83a84a1a4a85 a7a15a27 a87a79
a16
a71
a43
a16
a5 (2)
a7
a1
a81a3a2a5a4a7a6
a83a84a1a86a85a65a7 a27a35a71a90a89
a16
a7
a67
a87a79
a16
a71
a43
a16
a5
a7a75a38 a74
a1
a81a3a2a5a4 a6
a83a84a1a4a89
a16
a7
a67
a87a79
a16
a71
a43
a16
a5a9a8
a83a84a1a4a85a65a7 a38a92a87a79
a16
a71
a43
a16
a71a90a89
a16
a7
a67
a5a18a34 (3)
(3) indicates that the probability of starting an en-
tity can be computed using the linking probabilities
a83a84a1a4a85a95a7 a38 a87a79
a16
a71
a43
a16
a71a100a89
a16
a7
a67
a5 , provided that the marginal
a83a84a1a4a89
a81
a7
a67
a87a79
a16
a71
a43
a16
a5 is known. In this paper, a83a84a1a4a89
a16
a7
a67
a87a79
a16
a71
a43
a16
a5 is approximated as:
a83a84a1a4a89
a16
a7
a67
a87a79
a16
a71
a43
a16
a5a76a7a11a10
a12a13
a38 if
a67
a7a15a14a5a16a18a17a20a19a21a14a23a22
a44a24a2a5a4a7a6
a83a84a1a4a85 a7 a38 a87a79
a16
a71
a43
a16
a71a100a89
a16
a7a15a49a70a5
a27 otherwise
(4)
With the approximation (4), the starting probability (3)
is
a83a84a1a4a85a65a7a15a27 a87a79
a16
a71
a43
a16
a5
a7a75a38 a74 a19a21a14a23a22
a81a3a2a5a4 a6
a83a84a1a4a85a65a7 a38a92a87a79
a16
a71
a43
a16
a71a90a89
a16
a7
a67
a5 a34 (5)
The linking model (1) and approximated starting
model (5) can be used to score paths in the Bell tree.
For example, the score for the path (a)-(b2)-(c4) in Fig-
ure 1 is the product of the start probability from (a) to
(b2) and the linking probability from (b2) to (c4).
Since (5) is an approximation, not true probability, a
constant a25 is introduced to balance the linking proba-
bility and starting probability and the starting probabil-
ity becomes:
a83a27a26 a1a86a85a65a7a15a27 a87a79
a16
a71
a43
a16
a5a76a7
a25
a83a84a1a4a85a65a7 a27 a87a79
a16
a71
a43
a16
a5a18a34 (6)
If a25a29a28 a38 , it penalizes creating new entities; Therefore,
a25 is called start penalty. The start penalty a25 can be
used to balance entity miss and false alarm.
3.2 Model Training and Features
The model a83a84a1a4a85 a87a79 a16 a71 a43 a16 a71a100a89 a16 a7 a67 a5 depends on all par-
tial entities a79 a16 , which can be very expensive. After
making some modeling assumptions, we can approxi-
mate it as:
a83a84a1a4a85a65a7 a38a92a87a79
a16
a71
a43
a16
a71a90a89
a16
a7
a67
a5 (7)
a30a88a83a84a1a4a85a65a7 a38a92a87
a55 a81
a71
a43
a16
a5 (8)
a30a30a19a21a14a23a22
a31
a2
a11a33a32
a83a84a1a4a85 a7 a38a92a87
a43
a71
a43
a16
a5a18a34 (9)
From (7) to (8), entities other than the one in focus,
a55 a81 , are assumed to have no influence on the decision
of linking a43 a16 with a55a39a81 . (9) further assumes that the
entity-mention score can be obtained by the maximum
mention pair score. The model (9) is very similar to
the model in (Morton, 2000; Soon et al., 2001; Ng and
Cardie, 2002) while (8) has more conditions.
We use maximum entropy model (Berger et al.,
1996) for both the mention-pair model (9) and the
entity-mention model (8):
a83a84a1a86a85a88a87
a43 a44
a71
a43
a16
a5a13a7 a55a35a34a23a36
a6a35a37
a6a39a38a40a6a42a41
a31a44a43a3a45a31
a6
a45a46a48a47a24a49
a50 a1
a43 a44
a71
a43
a16
a5
a71 (10)
a83a84a1a4a85 a87
a55 a81
a71
a43
a16
a5a13a7 a55a35a34 a36
a6 a37
a6a39a38a40a6a42a41
a11a7a32
a45a31
a6
a45a46a48a47 a49
a50 a1
a55a39a81
a71
a43
a16
a5
a71 (11)
wherea57 a16 a1a51a8 a71a52a8 a71a90a85a73a5 is a feature and a53 a16 is its weight; a50 a1a33a8 a71a54a8a5
is a normalizing factor to ensure that (10) or (11) is a
probability. Effective training algorithm exists (Berger
et al., 1996) once the set of features a42 a57 a16 a1a33a8 a71a54a8 a71a100a85a68a5 a53 is se-
lected.
The basic features used in the models are tabulated
in Table 1.
Features in the lexical category are applicable to
non-pronominal mentions only. Distance features char-
acterize how far the two mentions are, either by the
number of tokens, by the number of sentences, or by
the number of mentions in-between. Syntactic fea-
tures are derived from parse trees output from a maxi-
mum entropy parser (Ratnaparkhi, 1997). The “Count”
feature calculates how many times a mention string is
seen. For pronominal mentions, attributes such as gen-
der, number, possessiveness and reflexiveness are also
used. Apart from basic features in Table 1, composite
features can be generated by taking conjunction of ba-
sic features. For example, a distance feature together
with reflexiveness of a pronoun mention can help to
capture that the antecedent of a reflexive pronoun is of-
ten closer than that of a non-reflexive pronoun.
The same set of basic features in Table 1 is used
in the entity-mention model, but feature definitions
are slightly different. Lexical features, including the
acronym features, and the apposition feature are com-
puted by testing any mention in the entity a55 a81 against the
active mention a43 a16 . Editing distance for a1 a55 a81 a71 a43 a16 a5 is de-
fined as the minimum distance over any non-pronoun
mentions and the active mention. Distance features are
computed by taking minimum between mentions in the
entity and the active mention.
In the ACE data, mentions are annotated with three
levels: NAME, NOMINAL and PRONOUN. For each
ACE entity, a canonical mention is defined as the
longest NAME mention if available; or if the entity
does not have a NAME mention, the most recent NOM-
INAL mention; if there is no NAME and NOMINAL
mention, the most recent pronoun mention. In the
entity-mention model, “ncd”,“spell” and “count” fea-
tures are computed over the canonical mention of the
in-focus entity and the active mention. Conjunction
features are used in the entity-mention model too.
The mention-pair model is appealing for its simplic-
ity: features are easy to compute over a pair of men-
Category Features Remark
Lexical exact_strm 1 if two mentions have the same spelling; 0 otherwise
left_subsm 1 if one mention is a left substring of the other; 0 otherwise
right_subsm 1 if one mention is a right substring of the other; 0 otherwise
acronym 1 if one mention is an acronym of the other; 0 otherwise
edit_dist quantized editing distance between two mention strings
spell pair of actual mention strings
ncd number of different capitalized words in two mentions
Distance token_dist how many tokens two mentions are apart (quantized)
sent_dist how many sentences two mentions are apart (quantized)
gap_dist how many mentions in between the two mentions in question (quantized)
Syntax POS_pair POS-pair of two mention heads
apposition 1 if two mentions are appositive; 0 otherwise
Count count pair of (quantized) numbers, each counting how many times a mention string is seen
Pronoun gender pair of attributes of {female, male, neutral, unknown }
number pair of attributes of {singular, plural, unknown}
possessive 1 if a pronoun is possessive; 0 otherwise
reflexive 1 if a pronoun is reflexive; 0 otherwise
Table 1: Basic features used in the maximum entropy model.
tions; its drawback is that information outside the men-
tion pair is ignored. Suppose a document has three
mentions “Mr. Clinton”, “Clinton” and “she”, appear-
ing in that order. When considering the mention pair
“Clinton” and “she”, the model may tend to link them
because of their proximity; But this mistake can be
easily avoided if “Mr. Clinton” and “Clinton” have
been put into the same entity and the model knows
“Mr. Clinton” referring to a male while “she” is fe-
male. Since gender and number information is prop-
agated at the entity level, the entity-mention model is
able to check the gender consistency when considering
the active mention “she”.
3.3 Discussion
There is an in-focus entity in the condition of the link-
ing model (1) while the starting model (2) conditions
on all left entities. The disparity is intentional as the
starting action is influenced by all established entities
on the left.
(4) is not the only way a83a84a1a4a89 a16 a7 a67 a87a79 a16 a71 a43 a16 a5 can be
approximated. For example, one could use a uniform
distribution over a66 a16 . We experimented several schemes
of approximation, including a uniform distribution, and
(4) worked the best and is adopted here. One may con-
sider training a83a84a1a4a89 a16 a7 a67
a87a79
a16
a71
a43
a16
a5 directly and use it to
score paths in the Bell tree. The problem is that 1) the
size of a66 a16 from which a89 a16 takes value is variable; 2) the
start action depends on all entities in a79 a16 , which makes
it difficult to train a83a84a1a4a89 a16 a7 a67 a87a79 a16 a71 a43 a16 a5 directly.
4 Search Issues
As shown in Section 2, the search space of the coref-
erence problem can be represented by the Bell tree.
Thus, the search problem reduces to creating the Bell
tree while keeping track of path scores and picking the
top-N best paths. This is exactly what is described in
Algorithm 1.
In Algorithm 1, a0 contains all the hypotheses, or
paths from the root to the current layer of nodes. Vari-
able a1 a1a86a79 a5 stores the cumulative score for a corefer-
ence result a79 . At line 1, a0 is initialized with a single
entity consisting of mention a43 a9 , which corresponds to
the root node of the Bell tree in Figure 1. Line 2 to 15
loops over the remaining mentions (a43a3a2 to a43a5a4 ), and for
each mention a43 a16 , the algorithm extends each result a79
in a0 (or a path in the Bell tree) by either linking a43
a16
with an existing entity a55 a44 (line 5 to 10), or starting an
entity a91a43 a16 a93 (line 11 to 14). The loop from line 2 to 12
corresponds to creating a new layer of nodes for the ac-
tive mention a43 a16 in the Bell tree. a6 a31 in line 4 and a7 in
line 6 and 11 have to do with pruning, which will be
discussed shortly. The last line returns top a8 results,
where a79 a41a10a9 a47 denotes the a11
a81a13a12 result ranked by
a1
a1a51a8a5 :
a1
a1a86a79
a41
a9 a47
a5a15a14
a1
a1a4a79
a41a10a2
a47
a5a15a14 a8a54a8a54a8a16a14
a1
a1a86a79
a41a10a17
a47
a5a22a34
Algorithm 1 Search Algorithm
Input: mentions a18 a7 a42a24a43a59a44a54a46 a38a98a71a24a34 a34a24a34a18a71a41a3a54a53 ; a8
Output: top a8 entity results
1:Initialize: a0 a46a7 a42 a79 a9 a46a7 a42a29a91a43 a9 a93a26a53a98a53a20a19 a1 a1a4a79 a9 a5a13a7a102a38
2:for a63 a7a15a25 to a3
3: foreach node a79 a82a5a0
4: compute a6
a31 .
5: foreach a49 a82 a66 a16
6: if ( a83 a1a86a85a65a7 a38 a87a79a84a71 a43 a16 a71a100a89 a7a15a49 a5a15a21 a7a22a6 a31 ) {
8: Extend a79 to a79 a101
a44
by linking a43 a16 with a55a80a44
9: a1
a1a86a79 a101
a44
a5
a46
a7
a1
a1a4a79 a5a20a8 a83 a1a4a85a65a7a102a38 a87a79 a71
a43
a16
a71a100a89 a7a15a49a70a5
10: }
11: if(a83a27a26a72a1a4a85a65a7 a27 a87a79a84a71 a43 a16 a5a15a21 a7a22a6 a31 ) {
12: Extend a79 to a79 a101 by starting a91a43 a16 a93 .
13: a1 a1a4a79 a101a5 a46a7 a1 a1a86a79 a5a27a8a24a83 a26 a1a86a85a65a7a15a27 a87a79a84a71 a43 a16 a5
14: }
15: a0 a46a7 a42 a79 a101a4a53a24a23 a42 a79 a101
a44
a46
a49
a82
a66
a16
a53 .
16:return a42
a79
a41
a9 a47
a71a90a79
a41a10a2
a47
a71a54a8a52a8a54a8a20a71a90a79
a41a10a17
a47
a53
The complexity of the search Algorithm 1 is the total
number of nodes in the Bell tree, which is a12 a4a16a18a17 a9 a0 a1 a63 a5 ,
where a0a2a1 a63 a5 is the Bell Number. Since the Bell number
increases rapidly as a function of the number of men-
tions, pruning is necessary. We prune the search space
in the following places:
a0 Local pruning: any children with a score below a
fixed factor a7 of the maximum score are pruned.
This is done at line 6 and 11 in Algorithm 1. The
operation in line 4 is:
a6
a31
a46
a7a15a19a21a14a42a22
a42
a83a27a26 a1a86a85 a7a15a27 a87a79a84a71
a43
a16
a5a22a53 a23
a42
a83a84a1a86a85 a7a102a38a92a87a79a84a71
a43
a16
a71a90a89 a7 a49a70a5
a46
a49
a82
a66
a16
a53a29a34
Block 8-9 is carried out only if a83a84a1a4a85 a7
a38 a87a79a84a71
a43
a16
a71a90a89 a7 a49a70a5 a21
a7a22a6
a31 and block 12-13 is car-
ried out only if a83a27a26a6a1a4a85a65a7 a27 a87a79a84a71 a43 a16 a5a15a21 a7a22a6 a31 .
a0 Global pruning: similar to local pruning except
that this is done using the cumulative score a1 a1a4a79 a5 .
Pruning based on the global scores is carried out
at line 15 of Algorithm 1.
a0 Limit hypotheses: we set a limit on the maxi-
mum number of live paths. This is useful when a
document contains many mentions, in which case
excessive number of paths may survive local and
global pruning.
a0 Whenever available, we check the compatibility
of entity types between the in-focus entity and the
active mention. A hypothesis with incompatible
entity types is discarded. In the ACE annotation,
every mention has an entity type. Therefore we
can eliminate hypotheses with two mentions of
different types.
5 Experiments
5.1 Performance Metrics
The official performance metric for the ACE task is
ACE-value. ACE-value is computed by first calculat-
ing the weighted cost of entity insertions, deletions and
substitutions; The cost is then normalized against the
cost of a nominal coreference system which outputs
no entities; The ACE-value is obtained by subtracting
the normalized cost from a38 . Weights are designed to
emphasize NAME entities, while PRONOUN entities
(i.e., an entity consisting of only pronominal mentions)
carry very low weights. A perfect coreference system
will get a a38a39a27a98a27a2a1 ACE-value while a system outputs no
entities will get a a27 ACE-value. Thus, the ACE-value
can be interpreted as percentage of value a system has,
relative to the perfect system.
Since the ACE-value is an entity-level metric and is
weighted heavily toward NAME entities, we also mea-
sure our system’s performance by an entity-constrained
mention F-measure (henceforth “ECM-F”). The metric
first aligns the system entities with the reference enti-
ties so that the number of common mentions is maxi-
mized. Each system entity is constrained to align with
at most one reference entity, and vice versa. For exam-
ple, suppose that a reference document contains three
entities: a42a92a91a43 a9 a93a26a71 a91a43 a2 a71 a43 a40 a93a26a71 a91a43a4a3 a93a26a53 while a system out-
puts four entities: a42 a91a43 a9 a71 a43 a2 a93a94a71 a91a43 a40 a93a26a71 a91a43a6a5 a93a26a71 a91a43a6a7 a93a26a53 , then
the best alignment (from reference to system) would be
a91a43
a9
a93a9a8
a91a43
a9
a71
a43a5a2
a93 ,
a91a43 a2
a71
a43
a40
a93a10a8
a91a43
a40
a93 and other entities
are not aligned. The number of common mentions of
the best alignment is a25 (i.e., a43 a9 anda43 a40 ), which leads to
a mention recall a2
a3
and precision a2
a5
. The ECM-F mea-
sures the percentage of mentions that are in the “right”
entities.
For tests on the MUC data, we report both F-measure
using the official MUC score (Vilain et al., 1995) and
ECM-F. The MUC score counts the common links be-
tween the reference and the system output.
5.2 Results on the ACE data
The system is first developed and tested using the ACE
data. The ACE coreference system is trained with a11 a38a13a12
documents (about a38a15a14a98a27a17a16 words) of ACE 2002 training
data. A separate a14a98a27 documents (a32a28a27a17a16 words) is used as
the development-test (Devtest) set. In 2002, NIST re-
leased two test sets in February (Feb02) and September
(Sep02), respectively. Statistics of the three test sets is
summarized in Table 2. We will report coreference re-
sults on the true mentions of the three test sets.
TestSet #-docs #-words #-mentions #-entities
Devtest 90 50426 7470 2891
Feb02 97 52677 7665 3104
Sep02 186 69649 10577 4355
Table 2: Statistics of three test sets.
For the mention-pair model, training events are gen-
erated for all compatible mention-pairs, which results
in about a14a98a27a17a14a18a16 events, about a38a39a32a28a27a17a16 of which are posi-
tive examples. The full mention-pair model uses about
a38a20a19 a38a15a16 features; Most are conjunction features. For the
entity-mention model, events are generated by walking
through the Bell tree. Only events on the true path (i.e.,
positive examples) and branches emitting from a node
on the true path to a node not on the true path (i.e.,
negative examples) are generated. For example, in Fig-
ure 1, suppose that the path (a)-(b2)-(c4) is the truth,
then positive training examples are starting event from
(a) to (b2) and linking event from (b2) to (c4); While
the negative examples are linking events from (a) to
(b1), (b2) to (c3), and the starting event from (b2) to
(c5). This scheme generates about a99a29a25a29a25a21a16 events, out of
which about a38a15a22a17a16 are positive training examples. The
full entity-mention model has about a22a35a34a11 a16 features, due
to less number of conjunction features and training ex-
amples.
Coreference results on the true mentions of the De-
vtest, Feb02, and Sep02 test sets are tabulated in Ta-
ble 3. These numbers are obtained with a fixed search
beam a25a98a27 and pruning threshold a7 a7 a27 a34a27a29a27a35a38 (widening
the search beam or using a smaller pruning threshold
did not change results significantly).
The mention-pair model in most cases performs bet-
ter than the mention-entity model by both ACE-value
and ECM-F measure although none of the differences
is statistically significant (pair-wise t-test) at p-value
a27a35a34a27a29a32 . Note that, however, the mention-pair model uses
a25a28a27 times more features than the entity-pair model. We
also observed that, because the score between the in-
focus entity and the active mention is computed by (9)
in the mention-pair model, the mention-pair sometimes
mistakenly places a male pronoun and female pronoun
into the same entity, while the same mistake is avoided
in the entity-mention model. Using the canonical men-
tions when computing some features (e.g., “spell”) in
the entity-mention model is probably not optimal and
it is an area that needs further research.
When the same mention-pair model is used to score
the ACE 2003 evaluation data, an ACE-value a19 a99 a34a11 a1 is
obtained on the system1 mentions. After retrained with
Chinese and Arabic data (much less training data than
English), the system got a32a21a22 a34a22 a1 and a32 a11 a34a32a2a1 ACE-value
on the system mentions of ACE 2003 evaluation data
for Chinese and Arabic, respectively. The results for
all three languages are among the top-tier submission
systems. Details of the mention detection and corefer-
ence system can be found in (Florian et al., 2004).
Since the mention-pair model is better, subsequent
analyses are done with the mention pair model only.
5.2.1 Feature Impact
To see how each category of features affects the per-
formance, we start with the aforementioned mention-
pair model, incrementally remove each feature cate-
gory, retrain the system and test it on the Devtest set.
The result is summarized in Table 4. The last column
lists the number of features. The second row is the full
mention-pair model, the third through seventh row cor-
respond to models by removing the syntactic features
(i.e., POS tags and apposition features), count features,
distance features, mention type and level information,
and pair of mention-spelling features. If a basic fea-
ture is removed, conjunction features using that basic
feature are also removed. It is striking that the small-
est system consisting of only a99a17a14 features (string and
substring match, acronym, edit distance and number of
different capitalized words) can get as much as a22a17a12a35a34a27a2a1
ACE-value. Table 4 shows clearly that these lexical
features and the distance features are the most impor-
tant. Sometimes the ACE-value increases after remov-
ing a set of features, but the ECM-F measure tracks
nicely the trend that the more features there are, the bet-
ter the performance is. This is because the ACE-value
1System mentions are output from a mention detection
system.
−2.5 −2 −1.5 −1 −0.5 00.65
0.7
0.75
0.8
0.85
0.9
log α
ACE−value or ECM−F
ECM−F
ACE−value
Figure 2: Performance vs. log start penalty
is a weighted metric. A small fluctuation of NAME
entities will impact the ACE-value more than many
NOMINAL or PRONOUN entities.
Model ACE-val(%) ECM-F(%) #-features
Full 89.8 73.20 (a1 2.9) 171K
-syntax 89.0 72.6 (a1 2.5) 71K
-count 89.4 72.0 (a1 3.3) 70K
-dist 86.7 *66.2 (a1 3.9) 24K
-type/level 86.8 65.7 (a1 2.2) 5.4K
-spell 86.0 64.4 (a1 1.9) 39
Table 4: Impact of feature categories. Numbers after a1
are the standard deviations. * indicates that the result
is significantly (pair-wise t-test) different from the line
above at a6 a7a15a27a35a34a27a29a32 .
5.2.2 Effect of Start Penalty
As discussed in Section 3.1, the start penalty a25 can
be used to balance the entity miss and false alarm. To
see this effect, we decode the Devtest set by varying the
start penalty and the result is depicted in Figure 2. The
ACE-value and ECM-F track each other fairly well.
Both achieve the optimal when a2a4a3 a17 a25 a30 a74 a27a35a34a22 .
5.3 Experiments on the MUC data
To see how the proposed algorithm works on the MUC
data, we test our algorithm on the MUC6 data. To min-
imize the change to the coreference system, we first
map the MUC data into the ACE style. The original
MUC coreference data does not have entity types (i.e.,
“ORGANIZATION”, “LOCATION” etc), required in
the ACE style. Part of entity types can be recovered
from the corresponding named-entity annotations. The
recovered named-entity label is propagated to all men-
tions belonging to the same entity. There are 504 out of
2072 mentions of the MUC6 formal test set and 695
out of 2141 mentions of the MUC6 dry-run test set
that cannot be assigned labels by this procedure. A
Devtest Feb02 Sep02
Model ACE-val(%) ECM-F(%) ACE-val(%) ECM-F(%) ACE-val(%) ECM-F(%)
MP 89.8 73.2 (a1 2.9) 90.0 73.1 (a1 4.0) 88.0 73.1 (a1 6.8)
EM 89.9 71.7 (a1 2.4) 88.2 70.8 (a1 3.9) 87.6 72.4 (a1 6.2)
Table 3: Coreference results on true mentions: MP – mention-pair model; EM – entity-mention model; ACE-val:
ACE-value; ECM-F: Entity-constrained Mention F-measure. MP uses a38a20a19 a38a15a16 features while EM uses only a22 a34a11 a16
features. None of the ECM-F differences between MP and EM is statistically significant at a6 a7 a27a35a34a27a29a32 .
generic type “UNKNOWN” is assigned to these men-
tions. Mentions that can be found in the named-entity
annotation are assumed to have the ACE mention level
“NAME”; All other mentions other than English pro-
nouns are assigned the level “NOMINAL.”
After the MUC data is mapped into the ACE-style,
the same set of feature templates is used to train
a coreference system. Two coreference systems are
trained on the MUC6 data: one trained with 30 dry-run
test documents (henceforth “MUC6-small”); the other
trained with 191 “dryrun-train” documents that have
both coreference and named-entity annotations (hence-
forth “MUC6-big”) in the latest LDC release.
To use the official MUC scorer, we convert the out-
put of the ACE-style coreference system back into the
MUC format. Since MUC does not require entity label
and level, the conversion from ACE to MUC is “loss-
less.”
Table 5 tabulates the test results on the true mentions
of the MUC6 formal test set. The numbers in the ta-
ble represent the optimal operating point determined by
ECM-F. The MUC scorer cannot be used since it inher-
ently favors systems that output fewer number of enti-
ties (e.g., putting all mentions of the MUC6 formal test
set into one entity will yield a a38a39a27a98a27a2a1 recall and a19a20a22 a34a14 a1
precision of links, which gives an a22a18a22a35a34a25a17a1 F-measure).
The MUC6-small system compares favorably with the
similar experiment in Harabagiu et al. (2001) in which
an a22 a38a98a34a14a2a1 F-measure is reported. When measured by
the ECM-F measure, the MUC6-small system has the
same level of performance as the ACE system, while
the MUC6-big system performs better than the ACE
system. The results show that the algorithm works well
on the MUC6 data despite some information is lost in
the conversion from the MUC format to the ACE for-
mat.
System MUC F-measure ECM-F
MUC6-small 83.9% 72.1%
MUC6-big 85.7% 76.8%
Table 5: Results on the MUC6 formal test set.
6 Related Work
There exists a large body of literature on the topic of
coreference resolution. We will compare this study
with some relevant work using machine learning or sta-
tistical methods only.
Soon et al. (2001) uses a decision tree model for
coreference resolution on the MUC6 and MUC7 data.
Leaves of the decision tree are labeled with “link” or
“not-link” in training. At test time, the system checks
a mention against all its preceding mentions, and the
first one labeled with “link” is picked as the antecedent.
Their work is later enhanced by (Ng and Cardie, 2002)
in several aspects: first, the decision tree returns scores
instead of a hard-decision of “link” or “not-link” so that
Ng and Cardie (2002) is able to pick the “best” candi-
date on the left, as opposed the first in (Soon et al.,
2001); Second, Ng and Cardie (2002) expands the fea-
ture sets of (Soon et al., 2001). The model in (Yang et
al., 2003) expands the conditioning scope by including
a competing candidate. Neither (Soon et al., 2001) nor
(Ng and Cardie, 2002) searches for the global optimal
entity in that they make locally independent decisions
during search. In contrast, our decoder always searches
for the best result ranked by the cumulative score (sub-
ject to pruning), and subsequent decisions depend on
earlier ones.
Recently, McCallum and Wellner (2003) proposed
to use graphical models for computing probabilities of
entities. The model is appealing in that it can poten-
tially overcome the limitation of mention-pair model in
which dependency among mentions other than the two
in question is ignored. However, models in (McCal-
lum and Wellner, 2003) compute directly the probabil-
ity of an entity configuration conditioned on mentions,
and it is not clear how the models can be factored to
do the incremental search, as it is impractical to enu-
merate all possible entities even for documents with a
moderate number of mentions. The Bell tree represen-
tation proposed in this paper, however, provides us with
a naturally incremental framework for coreference res-
olution.
Maximum entropy method has been used in coref-
erence resolution before. For example, Kehler (1997)
uses a mention-pair maximum entropy model, and two
methods are proposed to compute entity scores based
on the mention-pair model: 1) a distribution over en-
tity space is deduced; 2) the most recent mention of an
entity, together with the candidate mention, is used to
compute the entity-mention score. In contrast, in our
mention pair model, an entity-mention pair is scored
by taking the maximum score among possible mention
pairs. Our entity-mention model eliminates the need to
synthesize an entity-mention score from mention-pair
scores. Morton (2000) also uses a maximum entropy
mention-pair model, and a special “dummy” mention
is used to model the event of starting a new entity.
Features involving the dummy mention are essentially
computed with the single (normal) mention, and there-
fore the starting model is weak. In our model, the start-
ing model is obtained by “complementing” the linking
scores. The advantage is that we do not need to train
a starting model. To compensate the model inaccuracy,
we introduce a “starting penalty” to balance the linking
and starting scores.
To our knowledge, the paper is the first time the Bell
tree is used to represent the search space of the coref-
erence resolution problem.
7 Conclusion
We propose to use the Bell tree to represent the pro-
cess of forming entities from mentions. The Bell tree
represents the search space of the coreference reso-
lution problem. We studied two maximum entropy
models, namely the mention-pair model and the entity-
mention model, both of which can be used to score
entity hypotheses. A beam search algorithm is used
to search the best entity result. State-of-the-art perfor-
mance has been achieved on the ACE coreference data
across three languages.
Acknowledgments
This work was partially supported by the Defense Ad-
vanced Research Projects Agency and monitored by
SPAWAR under contract No. N66001-99-2-8916. The
views and findings contained in this material are those
of the authors and do not necessarily reflect the position
of policy of the Government and no official endorse-
ment should be inferred. We also would like to thank
the anonymous reviewers for suggestions of improving
the paper.
References
E.T. Bell. 1934. Exponential numbers. Amer. Math.
Monthly, pages 411–419.
Adam L. Berger, Stephen A. Della Pietra, and Vincent
J. Della Pietra. 1996. A maximum entropy approach
to natural language processing. Computational Lin-
guistics, 22(1):39–71, March.
R Florian, H Hassan, A Ittycheriah, H Jing, N Kamb-
hatla, X Luo, N Nicolov, and S Roukos. 2004. A
statistical model for multilingual entity detection and
tracking. In Daniel Marcu Susan Dumais and Salim
Roukos, editors, HLT-NAACL 2004: Main Proceed-
ings, pages 1–8, Boston, Massachusetts, USA, May
2 - May 7. Association for Computational Linguis-
tics.
Niyu Ge, John Hale, and Eugene Charniak. 1998. A
statistical approach to anaphora resolution. In Proc.
of the sixth Workshop on Very Large Corpora.
Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J.
Maiorano. 2001. Text and knowledge mining for
coreference resolution. In Proc. of NAACL.
J. Hobbs. 1976. Pronoun resolution. Technical report,
Dept. of Computer Science, CUNY, Technical Re-
port TR76-1.
A. Ittycheriah, L. Lita, N. Kambhatla, N. Nicolov,
S. Roukos, and M. Stys. 2003. Identifying and
tracking entity mentions in a maximum entropy
framework. In HLT-NAACL 2003: Short Papers,
May 27 - June 1.
Andrew Kehler. 1997. Probabilistic coreference in in-
formation extraction. In Proc. of EMNLP.
Andrew McCallum and Ben Wellner. 2003. To-
ward conditional models of identity uncertainty with
application to proper noun coreference. In IJCAI
Workshop on Information Integration on the Web.
R. Mitkov. 1998. Robust pronoun resolution with lim-
ited knowledge. In Procs. of the 17th Internaltional
Conference on Computational Linguistics, pages
869–875.
Thomas S. Morton. 2000. Coreference for NLP appli-
cations. In In Proceedings of the 38th Annual Meet-
ing of the Association for Computational Linguistics.
MUC-6. 1995. Proceedings of the Sixth Message
Understanding Conference(MUC-6), San Francisco,
CA. Morgan Kaufmann.
Vincent Ng and Claire Cardie. 2002. Improving ma-
chine learning approaches to coreference resolution.
In Proc. of ACL, pages 104–111.
NIST. 2003. The ACE evaluation plan.
www.nist.gov/speech/tests/ace/index.htm.
Adwait Ratnaparkhi. 1997. A Linear Observed Time
Statistical Parser Based on Maximum Entropy Mod-
els. In Second Conference on Empirical Methods in
Natural Language Processing, pages 1 – 10.
Wee Meng Soon, Hwee Tou Ng, and Chung Yong Lim.
2001. A machine learning approach to coreference
resolution of noun phrases. Computational Linguis-
tics, 27(4):521–544.
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, , and
L. Hirschman. 1995. A model-theoretic coreference
scoring scheme. In In Proc. of MUC6, pages 45–52.
Xiaofeng Yang, Guodong Zhou, Jian Su, and
Chew Lim Tan. 2003. Coreference resolution us-
ing competition learning approach. In Proc. of the
a11
a38a1a0
a81 ACL.
