Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 25–32, Vancouver, October 2005. c©2005 Association for Computational Linguistics
On Coreference Resolution Performance Metrics
Xiaoqiang Luo
1101 Kitchawan Road, Room 23-121
IBM T.J. Wastson Research Center
Yorktown Heights, NY 10598, U.S.A.
xiaoluo@us.ibm.com
Abstract
The paper proposes a Constrained Entity-
Alignment F-Measure (CEAF) for evaluating
coreference resolution. The metric is com-
puted by aligning reference and system entities
(or coreference chains) with the constraint that
a system (reference) entity is aligned with at
most one reference (system) entity. We show
that the best alignment is a maximum bipartite
matching problem which can be solved by the
Kuhn-Munkres algorithm. Comparative exper-
iments are conducted to show that the widely-
known MUC F-measure has serious flaws in
evaluating a coreference system. The proposed
metric is also compared with the ACE-Value,
the official evaluation metric in the Automatic
Content Extraction (ACE) task, and we con-
clude that the proposed metric possesses some
properties such as symmetry and better inter-
pretability missing in the ACE-Value.
1 Introduction
A working definition of coreference resolution is parti-
tioning the noun phrases we are interested in into equiv-
alence classes, each of which refers to a physical entity.
We adopt the terminologies used in the Automatic Con-
tent Extraction (ACE) task (NIST, 2003a) and call each
individual phrase a mention and equivalence class an en-
tity. For example, in the following text segment,
(1): “The American Medical Association
voted yesterday to install the heir apparent as
its president-elect, rejecting a strong, upstart
challenge by a district doctor who argued that
the nation’s largest physicians’ group needs
stronger ethics and new leadership.”
mentions are underlined, “American Medical Associa-
tion”, “its” and “group” refer to the same organization
(object) and they form an entity. Similarly, “the heir ap-
parent” and “president-elect” refer to the same person and
they form another entity. It is worth pointing out that the
entity definition here is different from what used in the
Message Understanding Conference (MUC) task (MUC,
1995; MUC, 1998) – ACE entity is called coreference
chain or equivalence class in MUC, and ACE mention is
called entity in MUC.
An important problem in coreference resolution is how
to evaluate a system’s performance. A good performance
metric should have the following two properties:
a0 Discriminativity: This refers to the ability to differ-
entiate a good system from a bad one. While this
criterion sounds trivial, not all performance metrics
used in the past possess this property.
a0 Interpretability: A good metric should be easy to in-
terpret. That is, there should be an intuitive sense of
how good a system is when a metric suggests that a
certain percentage of coreference results are correct.
For example, when a metric reports a1a3a2a5a4 or above
correct for a system, we would expect that the vast
majority of mentions are in right entities or corefer-
ence chains.
A widely-used metric is the link-based F-measure (Vi-
lain et al., 1995) adopted in the MUC task. It is computed
by first counting the number of common links between
the reference (or “truth”) and the system output (or “re-
sponse”); the link precision is the number of common
links divided by the number of links in the system out-
put, and the link recall is the number of common links
divided by the number of links in the reference. There
are known problems associated with the link-based F-
measure. First, it ignores single-mention entities since
no link can be found in these entities; Second, and more
importantly, it fails to distinguish system outputs with
different qualities: the link-based F-measure intrinsically
favors systems producing fewer entities, and may result
25
in higher F-measures for worse systems. We will revisit
these issues in Section 3.
To counter these shortcomings, Bagga and Baldwin
(1998) proposed a B-cubed metric, which first computes
a precision and recall for each individual mention, and
then takes the weighted sum of these individual preci-
sions and recalls as the final metric. While the B-cubed
metric fixes some of the shortcomings of the MUC F-
measure, it has its own problems: for example, the men-
tion precision/recall is computed by comparing entities
containing the mention and therefore an entity can be
used more than once. The implication of this drawback
will be revisited in Section 3.
In the ACE task, a value-based metric called ACE-
value (NIST, 2003b) is used. The ACE-value is com-
puted by counting the number of false-alarm, the num-
ber of miss, and the number of mistaken entities. Each
error is associated with a cost factor that depends on
things such as entity type (e.g., “LOCATION”, “PER-
SON”), and mention level (e.g., “NAME,” “NOMINAL,”
and “PRONOUN”). The total cost is the sum of the three
costs, which is then normalized against the cost of a nom-
inal system that does not output any entity. The ACE-
value is finally computed by subtracting the normalized
cost from a6 . A perfect coreference system will get a
a6a8a7a9a7
a4 ACE-value while a system outputs no entities will
get a a7 ACE-value. A system outputting many erroneous
entities could even get negative ACE-value. The ACE-
value is computed by aligning entities and thus avoids
the problems of the MUC F-measure. The ACE-value is,
however, hard to interpret: a system with a1 a7 a4 ACE-value
does not mean that a1 a7
a4 of system entities or mentions are
correct, but that the cost of the system, relative to the one
outputting no entity, is a6a10a7 a4 .
In this paper, we aim to develop an evaluation metric
that is able to measure the quality of a coreference system
– that is, an intuitively better system would get a higher
score than a worse system, and is easy to interpret. To this
end, we observe that coreference systems are to recognize
entities and propose a metric called Constrained Entity-
Aligned F-Measure (CEAF). At the core of the metric is
the optimal one-to-one map between subsets of reference
and system entities: system entities and reference entities
are aligned by maximizing the total entity similarity un-
der the constraint that a reference entity is aligned with
at most one system entity, and vice versa. Once the to-
tal similarity is defined, it is straightforward to compute
recall, precision and F-measure. The constraint imposed
in the entity alignment makes it impossible to “cheat” the
metric: a system outputting too many entities will be pe-
nalized in precision while a system outputting two few
entities will be penalized in recall. It also has the prop-
erty that a perfect system gets an F-measure a6 while a
system outputting no entity or no common mentions gets
an F-measure a7 . The proposed CEAF has a clear mean-
ing: for mention-based CEAF, it reflects the percentage
of mentions that are in the correct entities; For entity-
based CEAF, it reflects the percentage of correctly recog-
nized entities.
The rest of the paper is organized as follows. In Sec-
tion 2, the Constrained Entity-Alignment F-Measure is
presented in detail: the constraint entity alignment can
be represented by a bipartite graph and the optimal
alignment can be found by the Kuhn-Munkres algo-
rithm (Kuhn, 1955; Munkres, 1957). We also present
two entity-pair similarity measures that can be used in
CEAF: one is the absolute number of common mentions
between two entities, and the other is a “local” mention F-
measure between two entities. The two measures lead to
the mention-based and entity-based CEAF, respectively.
In Section 3, we compare the proposed metric with the
MUC link-based metric and ACE-value on both artificial
and real data, and point out the problems of the MUC
F-measure.
2 Constrained Entity-Alignment
F-Measure
Some notations are needed before we present the pro-
posed metric and the algorithm to compute the metric.
Let reference entities in a document a11 be
a12a14a13
a11a16a15a18a17a20a19a3a21a23a22a25a24a27a26a28a17a20a6a30a29a32a31a33a29a35a34a36a34a27a34a5a29a27a37
a12a14a13
a11a38a15a39a37a41a40a30a29
and system entities be
a42 a13
a11a38a15a43a17a20a19a3a44a45a22a25a24a27a26a28a17a20a6a46a29a47a31a33a29a35a34a36a34a36a34a48a29a35a37
a42 a13
a11a16a15a32a37a49a40a3a50
To simplify typesetting, we will omit the dependency on
a11 when it is clear from context, and write a12a14a13 a11a38a15 as a12 and
a42 a13
a11a16a15 as
a42 .
Let
a51
a17a53a52a55a54a41a56a57a19a3a37
a12
a37a58a29a35a37
a42
a37a49a40
a59
a17a53a52a55a60a9a61a48a19a9a37
a12
a37a49a29a27a37
a42
a37a41a40a30a29
and let a12a63a62a65a64a66a12 and a42 a62a67a64 a42 be any subsets with a51 enti-
ties. That is, a37 a12a63a62 a37a33a17 a51 and a37 a42 a62 a37a45a17 a51 . Let a68 a13a69a12a63a62 a29 a42 a62 a15
be the set of one-to-one entity maps from a12 a62 to a42 a62 , and
a68
a62 be the set of all possible one-to-one maps between
the size-a51 subsets of a12 and a42 . Or
a68
a13a69a12 a62
a29
a42 a62
a15a18a17a70a19a72a71a73a24
a12 a62a65a74a75 a42 a62
a40a30a29
a68
a62
a17a77a76a43a78 a79a81a80a83a82a84a85a80a87a86a8a68
a13a88a12a63a62
a29
a42 a62
a15a89a50
The requirement of one-to-one map means that for any
a71a91a90a92a68
a13a69a12a63a62
a29
a42 a62
a15 , and any a21a93a90
a12a63a62 and
a21a23a94a95a90
a12a63a62 ,
we have that a21a97a96a17a98a21a83a94 implies that a71 a13 a21a99a15a100a96a17a101a71 a13 a21a23a94a102a15 , and
a71
a13
a21a99a15a28a96a17a103a71
a13
a21a23a94a102a15 implies that a21a101a96a17a104a21a23a94 . Clearly, there are
a51a53a105
one-to-one maps from a12a63a62 to a42 a62 (or a37a58a68 a13a69a12a63a62 a29 a42 a62 a15a39a37a99a17
a51a66a105 ), and
a37a49a68
a62
a37a45a17 a106
a62 a51a66a105 .
Let a107 a13 a21a108a29a27a44a109a15 be a “similarity” measure between two en-
tities a21 and a44 . a107
a13
a21a108a29a36a44a109a15 takes non-negative value: zero
26
value means that a21 and a44 have nothing in common. For
example, a107
a13
a21a110a29a27a44a109a15 could be the number of common men-
tions shared by a21 and a44 , and a107 a13 a21a108a29a36a21a99a15 the number of men-
tions in entity a21 .
For any a71a95a90a111a68 a62 , the total similarity a112 a13 a71a57a15 for a map a71
is the sum of similarities between the aligned entity pairs:
a112
a13
a71a57a15a28a17 a113a25a114
a79 a80
a107
a13
a21a108a29a89a71
a13
a21a99a15a32a15 . Given a document a11 , and
its reference entities a12 and system entities a42 , we can find
the best alignment maximizing the total similarity:
a71a116a115a117a17a53a60a9a118a47a119a120a52a55a60a9a61
a121 a114a46a122
a80
a112
a13
a71a116a15
a17a53a60a9a118a47a119a120a52a55a60a9a61
a121 a114a46a122
a80
a113a25a114
a79a81a80
a107
a13
a21a110a29a39a71
a13
a21a99a15a32a15a89a50 (1)
Let a12 a115a62 and a42 a115a62 a17a98a71 a115 a13a88a12 a115a62 a15 denote the reference and
system entity subsets where a71 a115 is attained, respectively.
Then the maximum total similarity is
a112
a13
a71 a115 a15a43a17
a113a25a114
a79a23a123a80
a107
a13
a21a108a29a89a71 a115
a13
a21a99a15a32a15a39a50 (2)
If we insist that a107 a13 a21a108a29a36a44a83a15a66a17a124a7 whenever a21 or a44 is
empty, then the non-negativity requirement of a107 a13 a21a108a29a36a44a83a15
makes it unnecessary to consider the possibility of map-
ping one entity to an empty entity since the one-to-one
map maximizing a112 a13 a71a116a15 must be in a68 a62 .
Since we can compute the entity self-similarity
a107
a13
a21a108a29a36a21a99a15 and a107
a13
a44a116a29a36a44a109a15 for any a21a125a90
a12 and
a44a126a90
a42 (i.e.,
using the identity map), we are now ready to define the
precision, recall and F-measure as follows:
a127
a17
a112
a13
a71 a115 a15
a22
a107
a13
a44a48a22a32a29a27a44a45a22a32a15
(3)
a128
a17
a112
a13
a71 a115 a15
a22
a107
a13
a21a129a22a47a29a36a21a129a22a32a15
(4)
a130
a17
a31
a127a3a128
a127a132a131a133a128
a50 (5)
The optimal alignment a71 a115 involves only a51 a17
a52a55a54a49a56a134a19a3a37
a12
a37a58a29a35a37
a42
a37a41a40 reference and system entities, and entities
not aligned do not get credit. Thus the F-measure (5) pe-
nalizes a coreference system that proposes too many (i.e.,
lower precision) or too few entities (i.e., lower recall),
which is a desired property.
In the above discussion, it is assumed that the sim-
ilarity measure a107 a13 a21a108a29a27a44a109a15 is computed for all entity pair
a13
a21a108a29a27a44a109a15 . In practice, computation of a107
a13
a21a110a29a27a44a109a15 can be
avoided if it is clear that a21 and a44 have nothing in common
(e.g., if no mention in a21 and a44 overlaps, then a107
a13
a21a108a29a27a44a109a15a43a17
a7 ). These entity pairs are not linked and they will not
be considered when searching for the optimal alignment.
Consequently the optimal alignment could involve less
than a51 reference and system entities. This can speed up
considerably the F-measure computation when the ma-
jority of entity pairs have zero similarity. Nevertheless,
summing over a51 entity pairs in the general formulae (2)
does not change the optimal total similarity between a12
and a42 and hence the F-measure.
In formulae (3)-(5), there is only one document in the
test corpus. Extension to corpus with multiple test doc-
uments is trivial: just accumulate statistics on the per-
document basis for both denominators and numerators in
(3) and (4), and find the ratio of the two.
So far, we have tacitly kept abstract the similarity mea-
sure a107 a13 a21a108a29a36a44a109a15 for entity pair a21 and a44 . We will defer the
discussion of this metric to Section 2.2. Instead, we first
present the algorithm computing the F-measure (3)-(5).
2.1 Computing Optimal Alignment and F-measure
A naive implementation of (1) would enumerate all the
possible one-to-one maps (or alignments) between size-
a51 (recall that a51
a17a135a52a132a54a41a56a57a19a9a37
a12
a37a49a29a27a37
a42
a37a41a40 ) subsets of
a12 and
size-a51 subsets of a42 , and find the best alignment max-
imizing the similarity. Since this requires computing
the similarities between a51 a59 entity pairs and there are
a37a49a68
a62
a37a116a17 a106
a62 a51a53a105 possible one-to-one maps, the complex-
ity of this implementation is a136 a13a39a59 a51a124a131 a106a62 a51a66a105 a15 . This
is not satisfactory even for a document with a moderate
number of entities: it will have about a137a45a50 a138 million opera-
tions for a59 a17 a51 a17a139a6a8a7 , a document with only a6a8a7 refer-
ence and a6a8a7 system entities.
Fortunately, the entity alignment problem under the
constraint that an entity can be aligned at most once is
the classical maximum bipartite matching problem and
there exists an algorithm (Kuhn, 1955; Munkres, 1957)
(henceforth Kuhn-Munkres Algorithm) that can find the
optimal solution in polynomial time. Casting the entity
alignment problem as the maximum bipartite matching is
trivial: each entity in a12 and a42 is a vertex and the node
pair a13 a21a110a29a27a44a109a15 , where a21a140a90 a12 , a44a141a90 a42 , is connected by an
edge with the weight a107 a13 a21a108a29a36a44a109a15 . Thus the problem (1) is
exactly the maximum bipartite matching.
With the Kuhn-Munkres algorithm, the procedure to
compute the F-measure (5) can be described as Algo-
rithm 1.
Algorithm 1 Computing the F-measure (5).
Input: reference entities:a12 ; system entities: a42
Output: optimal alignment a71 a115 ; F-measure (5).
1:Initialize: a71 a115 a17a77a142 ; a112 a13 a71 a115 a15a43a17a111a7 .
2:For a26a28a17a77a6 to a37
a12
a37
3: For a143a144a17a145a6 to a37 a42 a37
4: Compute a107 a13 a21a129a22a39a29a27a44a45a146a46a15 .
5:[a71 a115 ,a112
a13
a71 a115 a15 ]=KM {a147a38a148a88a149a99a150a72a151a87a152a116a153a27a149a20a154a156a155a157a150a8a151a157a154a159a158 } .
6:a160a25a148a88a155a120a152a57a161 a162a48a163a8a164a133a147a38a148a88a149a99a150a165a149a108a152 ; a160a25a148a88a158a23a152a87a161 a166a167a163a8a168a117a147a85a148a39a151a33a150a169a151a87a152 .
7:a170a110a161a67a171a85a172a174a173
a123a35a175
a171a38a172
a164 a175 ; a176a99a161a67a171a85a172a174a173
a123a35a175
a171a38a172
a168 a175 ; a177a20a161a140a178a165a179a169a180
a179a72a181a33a180
.
8:return a182a3a183 and a177 .
The input to the algorithm are reference entities a12 and
system entities a42 . The algorithm returns the best one-to-
27
one map a71 a115 and F-measure in equation (5). Loop from
line 2 to 4 computes the similarity between all the pos-
sible reference and system entity pairs. The complexity
of this loop is a136 a13a89a59 a51 a15 . Line 5 calls the Kuhn-Munkres
algorithm, which takes as input the entity-pair scores
a19a30a107
a13
a21a110a29a27a44a109a15a39a40 and outputs the best map a71 a115 and the corre-
sponding total similarity a112 a13 a71 a115 a15 . The worst case (i.e.,
when all entries in a19a30a107 a13 a21a110a29a27a44a109a15a39a40 are non-zeros) complexity
of the Kuhn-Algorithm is a136 a13a39a59 a51a185a184a116a186a41a187 a119 a51 a15 . Line 6 com-
putes “self-similarity” a112 a13a69a12 a15 and a112 a13 a42 a15 needed in the F-
measure computation at Line 7.
The core of the F-measure computation is the Kuhn-
Munkres algorithm at line 5. The algorithm is initially
discovered by Kuhn (1955) and Munkres (1957) to solve
the matching (a.k.a assignment) problem for square ma-
trices. Since then, it has been extended to rectangu-
lar matrices (Bourgeois and Lassalle, 1971) and paral-
lelized (Balas et al., 1991). A recent review can be found
in (Gupta and Ying, 1999), which also details the tech-
niques of fast implementation. A short description of the
algorithm is included in Appendix for the sake of com-
pleteness.
2.2 Entity Similarity Metric
In this section we consider the entity similarity metric
a107
a13
a21a108a29a36a44a83a15 defined on an entity pair
a13
a21a108a29a27a44a109a15 . It is desirable
that a107 a13 a21a108a29a36a44a109a15 is large when a21 and a44 are “close” and small
when a21 and a44 are very different. Some straight-forward
choices could be
a107a116a188
a13
a21a108a29a36a44a109a15a25a17a20a19
a6a30a29 if a21a77a17a20a44
a7a45a29 otherwisea50
(6)
a107 a184
a13
a21a108a29a36a44a109a15a25a17a20a19
a6a30a29 if a21a190a189a73a44a70a96a17a20a142
a7a45a29 otherwisea50
(7)
(6) insists that two entity are the same if all the mentions
are the same, while (7) goes to the other extreme: two
entities are the same if they share at least one common
mention.
(6) does not offer a good granularity of similarity: For
example, if a21a191a17a192a19a3a193a167a29a27a194a10a29a36a195a27a40 , and one system response
is a44 a188 a17a196a19a3a193a16a29a36a194a36a40 , and the other system response a44 a184 a17
a19a30a193a38a40 , then clearly a44 a188 is more similar to a21 than a44 a184 , yet
a107
a13
a21a108a29a36a44a197a188a36a15a198a17a199a107
a13
a21a110a29a27a44 a184 a15a14a17a200a7 . For the same reason, (7)
lacks of the desired discriminativity as well.
From the above argument, it is clear that we want to
have a metric that can measure the degree to which two
entities are similar, not a binary decision. One natural
choice is measuring how many common mentions two
entities share, and this can be measured by the absolute
number or relative number:
a107a167a201
a13
a21a108a29a27a44a109a15a202a17a20a37a49a21a190a189a73a44a18a37 (8)
a107a167a203
a13
a21a108a29a27a44a109a15a202a17
a31a45a37a49a21a190a189a63a44a18a37
a37a49a21a156a37
a131
a37a49a44a18a37
a50 (9)
Metric (8) simply counts the number of common men-
tions shared by a21 and a44 , while (9) is the mention F-
measure between a21 and a44 , a relative number measuring
how similar a21 and a44 are. For the abovementioned exam-
ple,
a107 a201
a13
a21a108a29a36a44 a188 a15a25a17a70a107 a201
a13
a19a3a193a167a29a27a194a10a29a36a195a27a40a30a29a27a19a3a193a16a29a36a194a36a40a167a15a57a17a66a31
a107a16a201
a13
a21a108a29a36a44 a184 a15a25a17a70a107a16a201
a13
a19a3a193a167a29a27a194a10a29a36a195a27a40a30a29a27a19a3a193a38a40a9a15a57a17a145a6
a107a16a203
a13
a21a108a29a36a44a134a188a27a15a25a17a70a107a16a203
a13
a19a3a193a167a29a27a194a10a29a36a195a27a40a30a29a27a19a3a193a16a29a36a194a36a40a167a15a57a17a66a7a33a50 a204
a107 a203
a13
a21a108a29a36a44 a184 a15a25a17a70a107 a203
a13
a19a3a193a167a29a27a194a10a29a36a195a27a40a30a29a27a19a3a193a38a40a9a15a57a17a66a7a33a50
a2
a29
thus both metrics give the desired ranking
a107 a201
a13
a21a108a29a36a44 a188 a15a202a205a20a107 a201
a13
a21a108a29a36a44 a184 a15 , a107 a203
a13
a21a108a29a27a44 a188 a15a202a205a77a107 a203
a13
a21a110a29a27a44 a184 a15 .
If a107a16a201 a13 a34a49a29a27a34a88a15 is adopted in Algorithm 1, a112 a13 a71 a115 a15 is the num-
ber of total common mentions corresponding to the best
one-to-one map a71 a115 while the denominators of (3) and (4)
are the number of proposed mentions and the number
of system mentions, respectively. The F-measure in (5)
can be interpreted as the ratio of mentions that are in the
“right” entities. Similarly, if a107a16a203 a13 a34a58a29a35a34a41a15 is adopted in Algo-
rithm 1, the denominators of (3) and (4) are the number
of proposed entities and the number of system entities,
respectively, and the F-measure in (5) can be understood
as the ratio of correct entities. Therefore, (5) is called
mention-based CEAF and entity-based CEAF when (8)
and (9) are used, respectively.
a107 a201
a13
a34a49a29a27a34a88a15 and a107 a203
a13
a34a58a29a35a34a41a15 are two reasonable entity similarity
measures, but by no means the only choices. At men-
tion level, partial credit could be assigned to two men-
tions with different but overlapping spans; or when men-
tion type is available, weights defined on the type confu-
sion matrix can be incorporated. At entity level, entity
attributes, if avaiable, can be weighted in the similarity
measure as well. For example, ACE data defines three
entity classes: NAME, NOMINAL and PRONOUN. Dif-
ferent weights can be assigned to the three classes.
No matter what entity similarity measure is used, it
is crucial to have the constraint that the document-level
similarity between reference entities and system entities
is calculated over the best one-to-one map. We will see
examples in Section 3 that misleading results could be
produced without the alignment constraint.
Another observation is that the same evaluation
paradigm can be used in any scenario that needs to mea-
sure the “closeness” between a set of system and refer-
ence objects, provided that a similarity between two ob-
jects is defined. For example, the 2004 ACE tasks include
detecting and recognizing relations in text documents. A
relation instance can be treated as an object and the same
evaluation paradigm can be applied.
3 Comparison with Other Metrics
In this section, we compare the proposed F-measure with
the MUC link-based F-measure (and its variation B-
cube F-measure) and the more recent ACE-value. The
28
1 2 3 4 5
6 7
8 9 A B C
(4) system response (c)
1 2 3 4 5
6 7
8 9 A B C
(1)  truth
1 2 3 4 5
6 7
8 9 A B C
(2) system response (a)
1 2 3 4 5
6 7
8 9 A B C
(3) system response (b)
1 2 3 4 5
6 7
8 9 A B C
(5) system response (d)
Figure 1: Example entities: (1)truth; (2)system response
(a); (3)system response (b); (4)system response (c);
(5)system response (d)
proposed metric has fixed problems associated with the
MUC and B-cube F-measure, and has better interpretabil-
ity than the ACE-value.
3.1 Comparison with the MUC F-measure and
B-cube Metric on Artificial Data
We use the example in Figure 1 to compare the
MUC link-based F-measure, B-cube, and the proposed
mention- and entity-based CEAF. In Figure 1, men-
tions are represented in circles and mentions in an en-
tity are connected by arrows. Intuitively, if each men-
tion is treated equally, the system response (a) is bet-
ter than the system response (b) since the latter mixes
two big entities, a19a9a6a30a29a32a31a33a29a47a137a33a29a47a206a33a29 a2 a40 and a19a10a204a45a29 a1 a29a36a207a117a29a27a208a159a29a27a209a99a40 , while
the former mixes a small entity a19a36a138a33a29a211a210a30a40 with one big en-
tity a19a36a204a33a29 a1 a29a36a207a81a29a36a208a159a29a27a209a99a40 . System response (b) is clearly better
than system response (c) since the latter puts all the men-
tions into a single entity while (b) has correctly separated
the entity a19a36a138a33a29a211a210a30a40 from the rest. The system response (d)
is the worst: the system does not link any mentions and
outputs a6a8a31 single-mention entities.
Table 1 summarizes various F-measures for system re-
sponse (a) to (d): the first column contains the indices
of the system responses found in Figure 1; the second
and third columns are the MUC F-measure and B-cubic
F-measure respectively; the last two columns are the pro-
posed CEAF F-measures, using the entity similarity met-
ric a107 a201 a13 a34a49a29a27a34a88a15 and a107 a203 a13 a34a58a29a35a34a41a15 , respectively.
As shown in Table 1, the MUC link-based F-measure
fails to distinguish the system response (a) and the system
response (b) as the two are assigned the same F-measure.
The system response (c) represents a trivial output: all
mentions are put in the same entity. Yet the MUC metric
will lead to a a6a8a7a9a7
a4 recall (a1 out of a1 reference links are
System CEAF
response MUC B-cube a107 a201 a13 a34a49a29a27a34a88a15 a107 a203 a13 a34a49a29a27a34a88a15
(a) 0.947 0.865 0.833 0.733
(b) 0.947 0.737 0.583 0.667
(c) 0.900 0.545 0.417 0.294
(d) – 0.400 0.250 0.178
Table 1: Comparison of coreference evaluation metrics
correct) and a a204a48a6a30a50 a31 a4 precision (a1 out of a6a3a6 system links
are correct), which gives rise to a a1 a7 a4 F-measure. It is
striking that a “bad” system response gets such a high
F-measure. Another problem with the MUC link-based
metric is that it is not able to handle single-mention enti-
ties, as there is no link for a single mention entity. That is
why the entry for system response (d) in Table 1 is empty.
B-cube F-measure ranks the four system responses
in Table 1 as desired. This is because B-cube met-
ric (Bagga and Baldwin, 1998) is computed based on
mentions (as opposed to links in the MUC F-measure).
But B-cube uses the same entity “intersecting” pro-
cedure found in computing the MUC F-measure (Vi-
lain et al., 1995), and it sometimes can give counter-
intuitive results. To see this, let us take a look at re-
call and precision for system response (c) and (d) for
B-cube metric. Notice that all the reference entities
are found after intersecting with the system responsce
(c): a19a3a19a167a6a46a29a47a31a33a29a32a137a45a29a32a206a45a29 a2 a40a3a29a27a19a36a138a33a29a35a210a3a40a30a29a35a19a8a204a33a29 a1 a29a36a207a81a29a36a208a159a29a27a209a99a40a3a40 . Therefore,
B-cube recall is a6a8a7a3a7
a4 (the corresponding precision is
a188
a188 a184a14a212
a13
a6a8a7
a212a67a213a188 a184
a131
a31
a212
a184
a188 a184
a15a66a17a124a7a33a50 a137a48a210
a2 ). This is counter-
intuitive because the set of reference entities is not a sub-
set of the proposed entities, thus the system response
should not have gotten a a6a10a7a3a7 a4 recall. The same prob-
lem exists for the system response (d): it gets a a6a10a7a3a7 a4
B-cube precision (the corresponding B-cube recall is
a188
a188 a184
a13 a2
a212
a188
a213
a131
a31
a212
a188
a184
a131
a2
a212
a188
a213
a15a132a17a139a7a33a50 a31
a2
a15 , but clearly not all
the entities in the system response (d) are correct! These
numebrs are summarized in Table 2, where columns with
a21 and a214 represent recall and precision, respectively.
System B-cube CEAF
response R P a215a211a216 -R a215a211a216 -P a215a211a217 -R a215a211a217 -P
(c) 1.0 0.375 0.417 0.417 0.196 0.588
(d) 0.25 1.0 0.250 0.250 0.444 0.111
Table 2: Example of counter-intuitive B-cube recall or
precision: system repsonse (c) gets a6a10a7a3a7 a4 recall (column
R) while system repsonse (d) gets a6a10a7a3a7 a4 precision (col-
umn P). The problem is fixed in both CEAF metrics.
The counter-intuitive results associated with the MUC
and B-cube F-measures are rooted in the procedure of
“intersecting” the reference and system entities, which al-
lows an entity to be used more than once! We will come
back to this after discussing the CEAF numbers.
From Table 1, we see that both mention-based ( col-
29
umn under a107 a201 a13 a34a58a29a35a34a41a15 ) CEAF and entity-based (a107 a203 a13 a34a49a29a27a34a88a15 )
CEAF are able to rank the four systems properly: sys-
tem (a) to (d) are increasingly worse. To see how the
CEAF numbers are computed, let us take the system re-
sponse (a) as an example: first, the best one-one entity
map is determined. In this case, the best map is: the
reference entity a19a167a6a46a29a32a31a45a29a32a137a45a29a32a206a33a29 a2 a40 is aligned to the system
entity a19a167a6a46a29a32a31a45a29a32a137a45a29a32a206a33a29 a2 a40 , the reference entity a19a36a204a33a29 a1 a29a36a207a81a29a36a208a218a29a36a209a99a40
is aligned to the system a19a36a138a33a29a35a210a30a29a32a204a45a29 a1 a29a27a207a117a29a27a208a159a29a36a209a99a40 and the
reference entity a19a10a138a33a29a211a210a30a40 is unaligned. The number
of common mentions is therefore a6a10a7 which results
in a mention-based (a107a167a201 a13 a34a49a29a27a34a88a15 ) recall
a213a219
and precision
a213a219
. Since a107 a203 a13 a19a167a6a46a29a32a31a45a29a32a137a45a29a32a206a33a29 a2 a40a3a29a27a19a9a6a30a29a32a31a45a29a32a137a33a29a47a206a33a29 a2 a40a9a15a141a17 a6 , and
a107 a203
a13
a19a10a204a45a29
a1
a29a27a207a117a29a27a208a159a29a36a209a99a40a3a29a27a19a36a138a33a29a35a210a30a29a32a204a45a29
a1
a29a27a207a117a29a27a208a159a29a36a209a99a40a167a15a116a17
a188a165a220
a188 a184
, a112
a13
a71 a115 a15a25a17
a6
a131
a188a165a220
a188 a184
(c.f. equation (4) and (3)), and the entity-based F-
measure (c.f. equation (9)) is therefore
a31
a212
a13
a6
a131
a188a165a220
a188 a184
a15
a137
a131
a31
a17
a6a3a6
a6
a2 a17a66a7a33a50a69a210a8a137a3a137a45a50
CEAF for other system responses are computed similarly.
CEAF recall and precision breakdown for system (c)
and (d) are listed in column 4 through 7 of Table 1. As can
be seen, neither mention-based nor entity-based CEAF
has the abovementioned problem associated with the B-
cube metric, and the recall and precision numbers are
more or less compatible with our intuition: for instance,
for system (c), based on a107 a201 -CEAF number, we can say
that about a206a5a6a46a50a69a210
a4 mentions are in the right entity, and
based on the a107a16a203 -CEAF recall and precision, we can state
that about a6 a1 a50 a138 a4 of “true” entities are recovered (recall)
and about a2 a204a33a50 a204 a4 of the proposed entities are correct.
A comparison of the procedures of computing the
MUC F-measure/B-cube and CEAF reveals that the cru-
cial difference is that the MUC and B-cube F-measure
allow an entity to be used multiple times while CEAF in-
sists that entity map be one-to-one. So an entity will never
get double credit. Take the system repsonse (c) as an ex-
ample, intersecting three reference entity in turn with the
reference entities produces the same set of reference enti-
ties, which leads to a6a10a7a3a7 a4 recall. In the intersection step,
the system entity is effectively used three times. In con-
trast, the system entity is aligned to only one reference
entity when computing CEAF.
3.2 Comparisons On Real Data
3.2.1 MUC F-measure and CEAF
We have seen the different behaviors of the MUC F-
measure, B-cube F-measure and CEAF on the artificial
data. We now compare the MUC F-measure, CEAF, and
ACE-value metrics on real data (compasion between the
MUC and B-cube F-measure can be found in (Bagga
and Baldwin, 1998)). Comparsion between the MUC F-
measure and CEAF is done on the MUC6 coreference test
set, while comparison between the CEAF and ACE-value
is done on the 2004 ACE data. The setup reflects the fact
that the official MUC scorer and ACE scorer run on their
own data format and are not easily portable to the other
data set. All the experiments in this section are done on
true mentions.
Penalty #sys-ent MUC-F a107a16a201 -CEAF
-0.6 561 .851 0.750
-0.8 538 .854 0.756
-0.9 529 .853 0.753
-1 515 .853 0.753
-1.1 506 .856 0.764
-1.2 483 .857 0.768
-1.4 448 .863 0.761
-1.5 425 .862 0.749
-1.6 411 .864 0.740
-1.7 403 .865 0.741
-10 113 .902 0.445
Table 3: MUC F-measure and mention-based CEAF on
the official MUC6 test set. The first column contains the
penalty value in decreasing order. The second column
contains the number of system-proposed entities. The
column under MUC-F is the MUC F-measure while a107 a201 -
CEAF is the mention-based CEAF.
The coreference system is similar to the one used
in (Luo et al., 2004). Results in Table 3 are produced
by a system trained on the MUC6 training data and tested
on the a137a3a7 official MUC6 test documents. The test set
contains a206a3a138a3a7 reference entities. The coreference system
uses a penalty parameter to balance miss and false alarm
errors: the smaller the parameter, the fewer entities will
be generated. We vary the parameter from a221a23a7a45a50 a138 to a221a99a6a10a7 ,
listed in the first column of Table 3, and compare the sys-
tem performance measured by the MUC F-measure and
the proposed mention-based CEAF.
As can be seen, the mention-based CEAF has a clear
maximum when the number of proposed entities is close
to the truth: at the penlaty value a221a99a6a30a50 a31 , the system pro-
duces a206a9a204a3a137 entities, very close to a206a9a138a3a7 , and the a107 a201 -CEAF
achieves the maximum a7a33a50a69a210a8a138a9a204 . In contrast, the MUC F-
measure increases almost monotonically as the system
proposes fewer and fewer entities. In fact, the best system
according to the MUC F-measure is the one proposing
only a6a9a6a8a137 entities. This demonstrates a fundamental flaw
of the MUC F-measure: the metric intrinsically favors
a system producing fewer entities and therefore lacks of
discriminativity.
3.2.2 ACE-Value and CEAF
Now let us turn to ACE-value. Results in Table 4 are
produced by a system trained on the ACE 2002 and 2004
training data and tested on a separate test set, which con-
tains a204 a2 a137 reference entities. Both ACE-value and the
mention-based CEAF penalizes systems over-producing
or under-producing entities: ACE-value is maximum
30
Penalty #sys-ent ACE-value(%) a107a16a201 -CEAF
0.6 1221 88.5 0.726
0.4 1172 89.1 0.749
0.2 1145 89.4 0.755
0 1105 89.7 0.766
-0.2 1050 89.7 0.775
-0.4 1015 89.7 0.780
-0.6 990 89.5 0.782
-0.8 930 88.6 0.794
-1 891 86.9 0.780
-1.2 865 86.7 0.778
-1.4 834 85.6 0.769
-1.6 790 83.8 0.761
Table 4: Comparison of ACE-value and mention-based
CEAF. The first column contains the penalty value in de-
creasing order. The second column contains the number
of system-proposed entities. ACE-values are in percent-
age. The number of reference entities is a204 a2 a137 .
when the penalty value is a221a23a7a33a50 a31 and CEAF is maximum
when the penalty value is a221a23a7a33a50 a204 . However, the optimal
CEAF system produces a1 a137a3a7 entities while the optimal
ACE-value system produces a6a8a7 a2 a7 entities. Judging from
the number of entities, the optimal CEAF system is closer
to the “truth” than the counterpart of ACE-value. This is
not very surprising since ACE-value is a weighted metric
while CEAF treats each mention and entity equally. As
such, the two metrics have very weak correlation.
While we can make a statement such as “the system
with penalty a221a23a7a45a50 a204 puts about a210 a1 a50 a206 a4 mentions in right
entities”, it is hard to interpret the ACE-value numbers.
Another difference is that CEAF is symmetric1, but
ACE-Value is not. Symmetry is a desirable property. For
example, when comparing inter-annotator agreement, a
symmetric metric is independent of the order of two sets
of input documents, while an asymmetric metric such as
ACE-Value needs to state the input order along with the
metric value.
4 Conclusions
A coreference performance metric – CEAF – is proposed
in this paper. The CEAF metric is computed based on the
best one-to-one map between reference entities and sys-
tem entities. Finding the best one-to-one map is a maxi-
mum bipartite matching problem and can be solved by the
Kuhn-Munkres algorithm. Two example entity-pair sim-
ilarity measures (i.e., a107a16a201 a13 a34a58a29a35a34a41a15 and a107a167a203 a13 a34a49a29a27a34a88a15 ) are proposed,
resulting one mention-based CEAF and one entity-based
CEAF, respectively. It has been shown that the pro-
posed CEAF metric has fixed problems associated with
the MUC link-based F-measure and B-cube F-measure.
1This was pointed out by Nanda Kambhatla.
The proposed metric also has better interpretability than
ACE-value.
Acknowledgments
This work was partially supported by the Defense Ad-
vanced Research Projects Agency and monitored by
SPAWAR under contract No. N66001-99-2-8916. The
views and findings contained in this material are those
of the authors and do not necessarily reflect the position
of policy of the Government and no official endorsement
should be inferred.
The author would like to thank three reviewers and my
colleagues, Hongyan Jing and Salim Roukos, for sugges-
tions of improving the paper.
References
Amit Bagga and Breck Baldwin. 1998. Algorithms
for scoring coreference chains. In Proceedings of the
Linguistic Coreference Workshop at The First Interna-
tional Conference on Language Resources and Evalu-
ation (LREC’98), pages 563–566.
Egon Balas, Donald Miller, Joseph Pekny, and Paolo
Toth. 1991. A parallel shortest augmenting path al-
gorithm for the assignment problem. Journal of the
ACM (JACM), 38(4).
Francois Bourgeois and Jean-Claude Lassalle. 1971. An
extension of the munkres algorithm for the assignment
problem to rectangular matrices. Communications of
the ACM, 14(12).
R. Fletcher. 1987. Practical Methods of Optimization.
John Wiley and Sons.
Anshul Gupta and Lexing Ying. 1999. Algorithms for
finding maximum matchings in bipartite graphs. Tech-
nical Report RC 21576 (97320), IBM T.J. Watson Re-
search Center, October.
H.W. Kuhn. 1955. The hungarian method for the assign-
ment problem. Naval Research Logistics Quarterly,
2(83).
Xiaoqiang Luo, Abe Ittycheriah, Hongyan Jing, Nanda
Kambhatla, and Salim Roukos. 2004. A mention-
synchronous coreference resolution algorithm based
on the bell tree. In Proc. of ACL.
MUC-6. 1995. Proceedings of the Sixth Message Un-
derstanding Conference(MUC-6), San Francisco, CA.
Morgan Kaufmann.
MUC-7. 1998. Proceedings of the Seventh Message Un-
derstanding Conference(MUC-7).
J. Munkres. 1957. Algorithms for the assignment and
transportation problems. Journal of SIAM, 5:32–38.
31
NIST. 2003a. The ACE evaluation plan.
www.nist.gov/speech/tests/ace/index.htm.
NIST. 2003b. Proceedings of ACE’03 workshop. Book-
let, Alexandria, VA, September.
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, , and
L. Hirschman. 1995. A model-theoretic coreference
scoring scheme. In In Proc. of MUC6, pages 45–52.
Appendix: Kuhn-Munkres Algorithm
Let a26 index the reference entities a12 and a143 index the sys-
tem entities a42 , and a107 a13 a26a10a29a39a143a33a15 be the similarity between the
a26a47a222a69a223 reference entity and the a143a9a222a69a223 system entity. Alge-
braically, the maximum bipartite matching can be stated
as an integer programming problem:
a52a132a60a9a61
a224a102a225a3a226a174a227a169a228
a107
a13
a26a10a29a89a143a45a15a41a229a116a22a49a146 (10)
subject to:
a146
a229a87a22a41a146a110a230a145a6a30a29a39a231a38a26 (11)
a22
a229 a22a41a146 a230a145a6a30a29a39a231a38a143 (12)
a229a116a22a49a146a110a90a198a19a10a7a45a29a35a6a30a40a3a29a39a231a38a26a10a29a39a143a46a50 (13)
If a229a87a22a41a146a111a17a232a6 , the a26 a222a69a223 reference entity and the a143 a222a69a223 system
entity are aligned. Constraint (11) (or (12)) implies that a
reference (or system) entity cannot be aligned more than
once with a system (or reference) entity.
Observe that the coefficients of (11) and (12) are uni-
modular. Thus, Constraint (13) can be replaced by
a229 a22a41a146a81a233 a7a33a29a32a231a38a26a10a29a89a143a30a50 (14)
The dual (cf. pp. 219 of (Fletcher, 1987)) to the opti-
mization problem (10) with constraints (11),(12) and (14)
is:
a52a55a54a49a56
a224a102a234 a226 a228
a82
a224a102a235 a227 a228
a22
a236
a22
a131
a146
a237
a146 (15)
a238
a50a239a36a50a48a24
a236
a22
a131a133a237
a146a117a233 a107
a13
a26a10a29a89a143a45a15a39a29a39a231a38a26a10a29a89a143 (16)
a236
a22 a233 a7a33a29a39a231a85a26 (17)
a237
a146 a233 a7a33a29a39a231a85a143a46a50 (18)
The dual has the same optimal objective value as the pri-
mal.
It can be shown that the optimal conditions for the dual
problem (and hence the maximum similarity match) are:
a236
a22
a131a133a237
a146a108a17a104a107
a13
a26a10a29a89a143a45a15a39a29 if
a13
a26a10a29a39a143a33a15 is aligned (19)
a236
a22 a17a66a7a33a29 if a26 is free (i.e., not aligned) (20)
a237
a146 a17a66a7a33a29 if a143 is free. (21)
The Kuhn-Munkres algorithm starts with an empty
match and an initial feasible set of a19 a236 a22a32a40 and a19 a237 a146a211a40 , and
iteratively increases the cardinality of the match while
satisfying the optimal conditions (19)-(21). Notice that
conceptually, a matching problem with a rectangular
matrix a240a241a107 a13 a26a10a29a39a143a33a15a243a242 can always reduce to a square one by
padding zeros (this is not necessary in practice, see, for
instance (Bourgeois and Lassalle, 1971)). For this rea-
son, we state the Kuhn-Munkres algorithm for the case
where a37a12 a37a48a17a145a37a42 a37 (or a59 a17 a51 ) in Algorithm 2. The proof
of correctness is omitted due to space limit.
Note that a214a87a244 a234 a121 a13 a26a10a29a89a143a45a15 on line 9 stands for the augment-
ing (i.e., a free node followed by an aligned node, fol-
lowed by a free node, ...) path from a26 to a143 in the corre-
sponding bipartite graph. a207a133a245a95a214a87a244 a234 a121 a13 a26a8a29a39a143a45a15 is understood as
edge “exclusive-or:” if an edge a13a89a246 a29a36a247a248a15 is in a207 and on the
path a214 a244 a234 a121 a13 a26a10a29a89a143a45a15 , it will be removed from a207 ; if the edge is
in either a207 or a214a5a244 a234 a121 a13 a26a10a29a39a143a33a15 , it will be added.
Algorithm 2 Kuhn-Munkres Algorithm
Input: similarity matrix: a240a58a107 a13 a26a10a29a89a143a45a15a243a242
Output: best match a207a77a17a20a19 a26a8a29a39a143a45a15a39a40 and similarity a112 .
1:Initialize: a231a38a26 , a236 a22a25a17a53a52a55a60a3a61a38a146a134a107 a13 a26a10a29a89a143a45a15 ; a231a38a143 , a237 a146a81a17a66a7 ; a207a20a17a20a142 .
2:For a26a28a17a77a6 to a59
3: If a26 is not free, Continue; EndIF.
4: a249a199a17a20a19a8a26a36a40 , a250a67a17a77a142 ;
5: While true
6: a251 a13 a249a104a15a25a17a70a19a30a247a116a24a33a252 a246 a90a144a249a14a29 a238 a50a239a36a50a41a107 a13a89a246 a29a27a247a248a15a25a17 a236a57a253a108a131a185a237a167a254 a40
7: If a250
a64
a251
a13
a249a255a15
8: pick a143a55a90a14a251 a13 a249a255a15a1a0a23a250
9: If a143 is free
10: a207a20a17a70a207a66a245a190a214 a244 a234 a121 a13 a26a10a29a89a143a45a15 ; break
11: Else
12: Find a26a39a94 such that a13 a26a39a94 a29a39a143a33a15a202a90a198a207 .
13: a249 a17a111a249a98a76a63a19a8a26a89a94a102a40a30a29a27a250a65a17a70a250a70a76a73a19a72a143a33a40 .
14: Goto line 6.
15: EndIf
16: Else a250a67a17a156a17a104a251 a13 a249a255a15
17: a2a156a17a53a52a55a54a41a56 a253 a114a4a3 a82a254 a114a6a5a7 a19 a236 a253 a131a185a237 a254 a221a190a107 a13a89a246 a29a27a247a248a15a39a40
18: a13a9a8a26a8a29 a8a143a45a15a25a17a70a193 a128 a71a83a52a55a54a41a56 a253 a114a4a3 a82a254 a114a10a5a7 a19 a236 a253 a131a185a237 a254 a221a190a107 a13a39a246 a29a36a247a102a15a39a40
19: a236a57a253 a17 a236a57a253 a221a11a2 for a246 a90a144a249 .
20: a237a167a254 a17
a237a167a254a48a131
a2 for a247a116a90a100a250 .
21: a143a120a17 a8a143 . Goto line 9.
22: EndIf
23: EndWhile
24:EndFor
25:a112a190a17
a78
a253
a82
a254
a86
a114a4a12a157a107
a13a39a246
a29a36a247a102a15 .
26:Return a207 and a112 .
32
