File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1004_metho.xml
Size: 29,794 bytes
Last Modified: 2025-10-06 14:09:27
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1004"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 25-32, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics On Coreference Resolution Performance Metrics</Title> <Section position="3" start_page="0" end_page="25" type="metho"> <SectionTitle> (1): &quot;The American Medical Association </SectionTitle> <Paragraph position="0"> voted yesterday to install the heir apparent as its president-elect, rejecting a strong, upstart challenge by a district doctor who argued that the nation's largest physicians' group needs stronger ethics and new leadership.&quot; mentions are underlined, &quot;American Medical Association&quot;, &quot;its&quot; and &quot;group&quot; refer to the same organization (object) and they form an entity. Similarly, &quot;the heir apparent&quot; and &quot;president-elect&quot; refer to the same person and they form another entity. It is worth pointing out that the entity definition here is different from what used in the Message Understanding Conference (MUC) task (MUC, 1995; MUC, 1998) - ACE entity is called coreference chain or equivalence class in MUC, and ACE mention is called entity in MUC.</Paragraph> <Paragraph position="1"> An important problem in coreference resolution is how to evaluate a system's performance. A good performance metric should have the following two properties: a0 Discriminativity: This refers to the ability to differentiate a good system from a bad one. While this criterion sounds trivial, not all performance metrics used in the past possess this property.</Paragraph> <Paragraph position="2"> a0 Interpretability: A good metric should be easy to interpret. That is, there should be an intuitive sense of how good a system is when a metric suggests that a certain percentage of coreference results are correct.</Paragraph> <Paragraph position="3"> For example, when a metric reports a1a3a2a5a4 or above correct for a system, we would expect that the vast majority of mentions are in right entities or coreference chains.</Paragraph> <Paragraph position="4"> A widely-used metric is the link-based F-measure (Vilain et al., 1995) adopted in the MUC task. It is computed by first counting the number of common links between the reference (or &quot;truth&quot;) and the system output (or &quot;response&quot;); the link precision is the number of common links divided by the number of links in the system output, and the link recall is the number of common links divided by the number of links in the reference. There are known problems associated with the link-based Fmeasure. First, it ignores single-mention entities since no link can be found in these entities; Second, and more importantly, it fails to distinguish system outputs with different qualities: the link-based F-measure intrinsically favors systems producing fewer entities, and may result in higher F-measures for worse systems. We will revisit these issues in Section 3.</Paragraph> <Paragraph position="5"> To counter these shortcomings, Bagga and Baldwin (1998) proposed a B-cubed metric, which first computes a precision and recall for each individual mention, and then takes the weighted sum of these individual precisions and recalls as the final metric. While the B-cubed metric fixes some of the shortcomings of the MUC Fmeasure, it has its own problems: for example, the mention precision/recall is computed by comparing entities containing the mention and therefore an entity can be used more than once. The implication of this drawback will be revisited in Section 3.</Paragraph> <Paragraph position="6"> In the ACE task, a value-based metric called ACE-value (NIST, 2003b) is used. The ACE-value is computed by counting the number of false-alarm, the number of miss, and the number of mistaken entities. Each error is associated with a cost factor that depends on things such as entity type (e.g., &quot;LOCATION&quot;, &quot;PER-SON&quot;), and mention level (e.g., &quot;NAME,&quot; &quot;NOMINAL,&quot; and &quot;PRONOUN&quot;). The total cost is the sum of the three costs, which is then normalized against the cost of a nominal system that does not output any entity. The ACE-value is finally computed by subtracting the normalized cost from a6 . A perfect coreference system will get a a6a8a7a9a7 a4 ACE-value while a system outputs no entities will get a a7 ACE-value. A system outputting many erroneous entities could even get negative ACE-value. The ACE-value is computed by aligning entities and thus avoids the problems of the MUC F-measure. The ACE-value is, however, hard to interpret: a system with a1 a7 a4 ACE-value does not mean that a1 a7 a4 of system entities or mentions are correct, but that the cost of the system, relative to the one outputting no entity, is a6a10a7 a4 .</Paragraph> <Paragraph position="7"> In this paper, we aim to develop an evaluation metric that is able to measure the quality of a coreference system - that is, an intuitively better system would get a higher score than a worse system, and is easy to interpret. To this end, we observe that coreference systems are to recognize entities and propose a metric called Constrained Entity-Aligned F-Measure (CEAF). At the core of the metric is the optimal one-to-one map between subsets of reference and system entities: system entities and reference entities are aligned by maximizing the total entity similarity under the constraint that a reference entity is aligned with at most one system entity, and vice versa. Once the total similarity is defined, it is straightforward to compute recall, precision and F-measure. The constraint imposed in the entity alignment makes it impossible to &quot;cheat&quot; the metric: a system outputting too many entities will be penalized in precision while a system outputting two few entities will be penalized in recall. It also has the prop-erty that a perfect system gets an F-measure a6 while a system outputting no entity or no common mentions gets an F-measure a7 . The proposed CEAF has a clear meaning: for mention-based CEAF, it reflects the percentage of mentions that are in the correct entities; For entity-based CEAF, it reflects the percentage of correctly recognized entities.</Paragraph> <Paragraph position="8"> The rest of the paper is organized as follows. In Section 2, the Constrained Entity-Alignment F-Measure is presented in detail: the constraint entity alignment can be represented by a bipartite graph and the optimal alignment can be found by the Kuhn-Munkres algorithm (Kuhn, 1955; Munkres, 1957). We also present two entity-pair similarity measures that can be used in CEAF: one is the absolute number of common mentions between two entities, and the other is a &quot;local&quot; mention F-measure between two entities. The two measures lead to the mention-based and entity-based CEAF, respectively.</Paragraph> <Paragraph position="9"> In Section 3, we compare the proposed metric with the MUC link-based metric and ACE-value on both artificial and real data, and point out the problems of the MUC F-measure.</Paragraph> </Section> <Section position="4" start_page="25" end_page="27" type="metho"> <SectionTitle> 2 Constrained Entity-Alignment </SectionTitle> <Paragraph position="0"> F-Measure Some notations are needed before we present the proposed metric and the algorithm to compute the metric. Let reference entities in a document a11 be</Paragraph> <Paragraph position="2"> To simplify typesetting, we will omit the dependency on a11 when it is clear from context, and write a12a14a13 a11a38a15 as a12 and</Paragraph> <Paragraph position="4"> and let a12a63a62a65a64a66a12 and a42 a62a67a64 a42 be any subsets with a51 entities. That is, a37 a12a63a62 a37a33a17 a51 and a37 a42 a62 a37a45a17 a51 . Let a68 a13a69a12a63a62 a29 a42 a62 a15 be the set of one-to-one entity maps from a12 a62 to a42 a62 , and a62 be the set of all possible one-to-one maps between the size-a51 subsets of a12 and a42 . Or</Paragraph> <Paragraph position="6"> The requirement of one-to-one map means that for any value means that a21 and a44 have nothing in common. For example, a107 a21a110a29a27a44a109a15 could be the number of common mentions shared by a21 and a44 , and a107 a13 a21a108a29a36a21a99a15 the number of mentions in entity a21 .</Paragraph> <Paragraph position="7"> For any a71a95a90a111a68 a62 , the total similarity a112 a13 a71a57a15 for a map a71 is the sum of similarities between the aligned entity pairs:</Paragraph> <Paragraph position="9"> its reference entities a12 and system entities a42 , we can find the best alignment maximizing the total similarity:</Paragraph> <Paragraph position="11"> Let a12 a115a62 and a42 a115a62 a17a98a71 a115 a13a88a12 a115a62 a15 denote the reference and system entity subsets where a71 a115 is attained, respectively.</Paragraph> <Paragraph position="12"> Then the maximum total similarity is</Paragraph> <Paragraph position="14"> If we insist that a107 a13 a21a108a29a36a44a83a15a66a17a124a7 whenever a21 or a44 is empty, then the non-negativity requirement of a107 a13 a21a108a29a36a44a83a15 makes it unnecessary to consider the possibility of mapping one entity to an empty entity since the one-to-one map maximizing a112 a13 a71a116a15 must be in a68 a62 .</Paragraph> <Paragraph position="15"> Since we can compute the entity self-similarity</Paragraph> <Paragraph position="17"> using the identity map), we are now ready to define the precision, recall and F-measure as follows: a37a41a40 reference and system entities, and entities not aligned do not get credit. Thus the F-measure (5) penalizes a coreference system that proposes too many (i.e., lower precision) or too few entities (i.e., lower recall), which is a desired property.</Paragraph> <Paragraph position="18"> In the above discussion, it is assumed that the similarity measure a107 a13 a21a108a29a27a44a109a15 is computed for all entity pair</Paragraph> <Paragraph position="20"> be considered when searching for the optimal alignment.</Paragraph> <Paragraph position="21"> Consequently the optimal alignment could involve less than a51 reference and system entities. This can speed up considerably the F-measure computation when the majority of entity pairs have zero similarity. Nevertheless, summing over a51 entity pairs in the general formulae (2) does not change the optimal total similarity between a12 and a42 and hence the F-measure.</Paragraph> <Paragraph position="22"> In formulae (3)-(5), there is only one document in the test corpus. Extension to corpus with multiple test documents is trivial: just accumulate statistics on the perdocument basis for both denominators and numerators in (3) and (4), and find the ratio of the two.</Paragraph> <Paragraph position="23"> So far, we have tacitly kept abstract the similarity measure a107 a13 a21a108a29a36a44a109a15 for entity pair a21 and a44 . We will defer the discussion of this metric to Section 2.2. Instead, we first present the algorithm computing the F-measure (3)-(5).</Paragraph> <Section position="1" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 2.1 Computing Optimal Alignment and F-measure </SectionTitle> <Paragraph position="0"> A naive implementation of (1) would enumerate all the possible one-to-one maps (or alignments) between size- null size-a51 subsets of a42 , and find the best alignment maximizing the similarity. Since this requires computing the similarities between a51 a59 entity pairs and there are</Paragraph> <Paragraph position="2"> a62 a51a53a105 possible one-to-one maps, the complexity of this implementation is a136 a13a39a59 a51a124a131 a106a62 a51a66a105 a15 . This is not satisfactory even for a document with a moderate number of entities: it will have about a137a45a50 a138 million operations for a59 a17 a51 a17a139a6a8a7 , a document with only a6a8a7 reference and a6a8a7 system entities.</Paragraph> <Paragraph position="3"> Fortunately, the entity alignment problem under the constraint that an entity can be aligned at most once is the classical maximum bipartite matching problem and there exists an algorithm (Kuhn, 1955; Munkres, 1957) (henceforth Kuhn-Munkres Algorithm) that can find the optimal solution in polynomial time. Casting the entity alignment problem as the maximum bipartite matching is trivial: each entity in a12 and a42 is a vertex and the node pair a13 a21a110a29a27a44a109a15 , where a21a140a90 a12 , a44a141a90 a42 , is connected by an edge with the weight a107 a13 a21a108a29a36a44a109a15 . Thus the problem (1) is exactly the maximum bipartite matching.</Paragraph> <Paragraph position="4"> With the Kuhn-Munkres algorithm, the procedure to compute the F-measure (5) can be described as Algo- null rithm 1.</Paragraph> <Paragraph position="5"> Algorithm 1 Computing the F-measure (5).</Paragraph> <Paragraph position="7"> The input to the algorithm are reference entities a12 and system entities a42 . The algorithm returns the best one-to- null one map a71 a115 and F-measure in equation (5). Loop from line 2 to 4 computes the similarity between all the possible reference and system entity pairs. The complexity of this loop is a136 a13a89a59 a51 a15 . Line 5 calls the Kuhn-Munkres algorithm, which takes as input the entity-pair scores</Paragraph> <Paragraph position="9"> a21a110a29a27a44a109a15a39a40 and outputs the best map a71 a115 and the corresponding total similarity a112 a13 a71 a115 a15 . The worst case (i.e., when all entries in a19a30a107 a13 a21a110a29a27a44a109a15a39a40 are non-zeros) complexity of the Kuhn-Algorithm is a136 a13a39a59 a51a185a184a116a186a41a187 a119 a51 a15 . Line 6 computes &quot;self-similarity&quot; a112 a13a69a12 a15 and a112 a13 a42 a15 needed in the F-measure computation at Line 7.</Paragraph> <Paragraph position="10"> The core of the F-measure computation is the Kuhn-Munkres algorithm at line 5. The algorithm is initially discovered by Kuhn (1955) and Munkres (1957) to solve the matching (a.k.a assignment) problem for square matrices. Since then, it has been extended to rectangular matrices (Bourgeois and Lassalle, 1971) and parallelized (Balas et al., 1991). A recent review can be found in (Gupta and Ying, 1999), which also details the techniques of fast implementation. A short description of the algorithm is included in Appendix for the sake of completeness. null</Paragraph> </Section> <Section position="2" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 2.2 Entity Similarity Metric </SectionTitle> <Paragraph position="0"> In this section we consider the entity similarity metric</Paragraph> <Paragraph position="2"> that a107 a13 a21a108a29a36a44a109a15 is large when a21 and a44 are &quot;close&quot; and small when a21 and a44 are very different. Some straight-forward choices could be</Paragraph> <Paragraph position="4"> (6) insists that two entity are the same if all the mentions are the same, while (7) goes to the other extreme: two entities are the same if they share at least one common mention.</Paragraph> <Paragraph position="5"> (6) does not offer a good granularity of similarity: For example, if a21a191a17a192a19a3a193a167a29a27a194a10a29a36a195a27a40 , and one system response is a44 a188 a17a196a19a3a193a16a29a36a194a36a40 , and the other system response a44 a184 a17 a19a30a193a38a40 , then clearly a44 a188 is more similar to a21 than a44 a184 , yet</Paragraph> <Paragraph position="7"> lacks of the desired discriminativity as well.</Paragraph> <Paragraph position="8"> From the above argument, it is clear that we want to have a metric that can measure the degree to which two entities are similar, not a binary decision. One natural choice is measuring how many common mentions two entities share, and this can be measured by the absolute number or relative number:</Paragraph> <Paragraph position="10"> tions shared by a21 and a44 , while (9) is the mention F-measure between a21 and a44 , a relative number measuring how similar a21 and a44 are. For the abovementioned example, null</Paragraph> <Paragraph position="12"> If a107a16a201 a13 a34a49a29a27a34a88a15 is adopted in Algorithm 1, a112 a13 a71 a115 a15 is the number of total common mentions corresponding to the best one-to-one map a71 a115 while the denominators of (3) and (4) are the number of proposed mentions and the number of system mentions, respectively. The F-measure in (5) can be interpreted as the ratio of mentions that are in the &quot;right&quot; entities. Similarly, if a107a16a203 a13 a34a58a29a35a34a41a15 is adopted in Algorithm 1, the denominators of (3) and (4) are the number of proposed entities and the number of system entities, respectively, and the F-measure in (5) can be understood as the ratio of correct entities. Therefore, (5) is called mention-based CEAF and entity-based CEAF when (8)</Paragraph> <Paragraph position="14"> a34a58a29a35a34a41a15 are two reasonable entity similarity measures, but by no means the only choices. At mention level, partial credit could be assigned to two mentions with different but overlapping spans; or when mention type is available, weights defined on the type confusion matrix can be incorporated. At entity level, entity attributes, if avaiable, can be weighted in the similarity measure as well. For example, ACE data defines three entity classes: NAME, NOMINAL and PRONOUN. Different weights can be assigned to the three classes.</Paragraph> <Paragraph position="15"> No matter what entity similarity measure is used, it is crucial to have the constraint that the document-level similarity between reference entities and system entities is calculated over the best one-to-one map. We will see examples in Section 3 that misleading results could be produced without the alignment constraint.</Paragraph> <Paragraph position="16"> Another observation is that the same evaluation paradigm can be used in any scenario that needs to measure the &quot;closeness&quot; between a set of system and reference objects, provided that a similarity between two objects is defined. For example, the 2004 ACE tasks include detecting and recognizing relations in text documents. A relation instance can be treated as an object and the same evaluation paradigm can be applied.</Paragraph> </Section> </Section> <Section position="5" start_page="27" end_page="30" type="metho"> <SectionTitle> 3 Comparison with Other Metrics </SectionTitle> <Paragraph position="0"> In this section, we compare the proposed F-measure with the MUC link-based F-measure (and its variation B-cube F-measure) and the more recent ACE-value. The proposed metric has fixed problems associated with the MUC and B-cube F-measure, and has better interpretability than the ACE-value.</Paragraph> <Section position="1" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 3.1 Comparison with the MUC F-measure and B-cube Metric on Artificial Data </SectionTitle> <Paragraph position="0"> We use the example in Figure 1 to compare the MUC link-based F-measure, B-cube, and the proposed mention- and entity-based CEAF. In Figure 1, mentions are represented in circles and mentions in an entity are connected by arrows. Intuitively, if each mention is treated equally, the system response (a) is better than the system response (b) since the latter mixes two big entities, a19a9a6a30a29a32a31a33a29a47a137a33a29a47a206a33a29 a2 a40 and a19a10a204a45a29 a1 a29a36a207a117a29a27a208a159a29a27a209a99a40 , while the former mixes a small entity a19a36a138a33a29a211a210a30a40 with one big entity a19a36a204a33a29 a1 a29a36a207a81a29a36a208a159a29a27a209a99a40 . System response (b) is clearly better than system response (c) since the latter puts all the mentions into a single entity while (b) has correctly separated the entity a19a36a138a33a29a211a210a30a40 from the rest. The system response (d) is the worst: the system does not link any mentions and outputs a6a8a31 single-mention entities.</Paragraph> <Paragraph position="1"> Table 1 summarizes various F-measures for system response (a) to (d): the first column contains the indices of the system responses found in Figure 1; the second and third columns are the MUC F-measure and B-cubic F-measure respectively; the last two columns are the proposed CEAF F-measures, using the entity similarity metric a107 a201 a13 a34a49a29a27a34a88a15 and a107 a203 a13 a34a58a29a35a34a41a15 , respectively. As shown in Table 1, the MUC link-based F-measure fails to distinguish the system response (a) and the system response (b) as the two are assigned the same F-measure.</Paragraph> <Paragraph position="2"> The system response (c) represents a trivial output: all mentions are put in the same entity. Yet the MUC metric will lead to a a6a8a7a9a7 correct) and a a204a48a6a30a50 a31 a4 precision (a1 out of a6a3a6 system links are correct), which gives rise to a a1 a7 a4 F-measure. It is striking that a &quot;bad&quot; system response gets such a high F-measure. Another problem with the MUC link-based metric is that it is not able to handle single-mention entities, as there is no link for a single mention entity. That is why the entry for system response (d) in Table 1 is empty.</Paragraph> <Paragraph position="3"> B-cube F-measure ranks the four system responses in Table 1 as desired. This is because B-cube metric (Bagga and Baldwin, 1998) is computed based on mentions (as opposed to links in the MUC F-measure).</Paragraph> <Paragraph position="4"> But B-cube uses the same entity &quot;intersecting&quot; procedure found in computing the MUC F-measure (Vilain et al., 1995), and it sometimes can give counter-intuitive results. To see this, let us take a look at recall and precision for system response (c) and (d) for B-cube metric. Notice that all the reference entities are found after intersecting with the system responsce</Paragraph> <Paragraph position="6"> intuitive because the set of reference entities is not a sub-set of the proposed entities, thus the system response should not have gotten a a6a10a7a3a7 a4 recall. The same problem exists for the system response (d): it gets a a6a10a7a3a7 a4 B-cube precision (the corresponding B-cube recall is a15 , but clearly not all the entities in the system response (d) are correct! These numebrs are summarized in Table 2, where columns with precision: system repsonse (c) gets a6a10a7a3a7 a4 recall (column R) while system repsonse (d) gets a6a10a7a3a7 a4 precision (column P). The problem is fixed in both CEAF metrics. The counter-intuitive results associated with the MUC and B-cube F-measures are rooted in the procedure of &quot;intersecting&quot; the reference and system entities, which allows an entity to be used more than once! We will come back to this after discussing the CEAF numbers.</Paragraph> <Paragraph position="7"> From Table 1, we see that both mention-based ( col- null umn under a107 a201 a13 a34a58a29a35a34a41a15 ) CEAF and entity-based (a107 a203 a13 a34a49a29a27a34a88a15 ) CEAF are able to rank the four systems properly: sys- null tem (a) to (d) are increasingly worse. To see how the CEAF numbers are computed, let us take the system response (a) as an example: first, the best one-one entity map is determined. In this case, the best map is: the reference entity a19a167a6a46a29a32a31a45a29a32a137a45a29a32a206a33a29 a2 a40 is aligned to the system entity a19a167a6a46a29a32a31a45a29a32a137a45a29a32a206a33a29 a2 a40 , the reference entity a19a36a204a33a29 a1 a29a36a207a81a29a36a208a218a29a36a209a99a40 is aligned to the system a19a36a138a33a29a35a210a30a29a32a204a45a29 a1 a29a27a207a117a29a27a208a159a29a36a209a99a40 and the reference entity a19a10a138a33a29a211a210a30a40 is unaligned. The number of common mentions is therefore a6a10a7 which results in a mention-based (a107a167a201 a13 a34a49a29a27a34a88a15 ) recall CEAF recall and precision breakdown for system (c) and (d) are listed in column 4 through 7 of Table 1. As can be seen, neither mention-based nor entity-based CEAF has the abovementioned problem associated with the B-cube metric, and the recall and precision numbers are more or less compatible with our intuition: for instance, for system (c), based on a107 a201 -CEAF number, we can say that about a206a5a6a46a50a69a210 a4 mentions are in the right entity, and based on the a107a16a203 -CEAF recall and precision, we can state that about a6 a1 a50 a138 a4 of &quot;true&quot; entities are recovered (recall) and about a2 a204a33a50 a204 a4 of the proposed entities are correct. A comparison of the procedures of computing the MUC F-measure/B-cube and CEAF reveals that the crucial difference is that the MUC and B-cube F-measure allow an entity to be used multiple times while CEAF insists that entity map be one-to-one. So an entity will never get double credit. Take the system repsonse (c) as an example, intersecting three reference entity in turn with the reference entities produces the same set of reference entities, which leads to a6a10a7a3a7 a4 recall. In the intersection step, the system entity is effectively used three times. In contrast, the system entity is aligned to only one reference entity when computing CEAF.</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.2 Comparisons On Real Data 3.2.1 MUC F-measure and CEAF </SectionTitle> <Paragraph position="0"> We have seen the different behaviors of the MUC Fmeasure, B-cube F-measure and CEAF on the artificial data. We now compare the MUC F-measure, CEAF, and ACE-value metrics on real data (compasion between the MUC and B-cube F-measure can be found in (Bagga and Baldwin, 1998)). Comparsion between the MUC F-measure and CEAF is done on the MUC6 coreference test set, while comparison between the CEAF and ACE-value is done on the 2004 ACE data. The setup reflects the fact that the official MUC scorer and ACE scorer run on their own data format and are not easily portable to the other data set. All the experiments in this section are done on true mentions.</Paragraph> <Paragraph position="1"> the official MUC6 test set. The first column contains the penalty value in decreasing order. The second column contains the number of system-proposed entities. The column under MUC-F is the MUC F-measure while a107 a201 -CEAF is the mention-based CEAF.</Paragraph> <Paragraph position="2"> The coreference system is similar to the one used in (Luo et al., 2004). Results in Table 3 are produced by a system trained on the MUC6 training data and tested on the a137a3a7 official MUC6 test documents. The test set contains a206a3a138a3a7 reference entities. The coreference system uses a penalty parameter to balance miss and false alarm errors: the smaller the parameter, the fewer entities will be generated. We vary the parameter from a221a23a7a45a50 a138 to a221a99a6a10a7 , listed in the first column of Table 3, and compare the system performance measured by the MUC F-measure and the proposed mention-based CEAF.</Paragraph> <Paragraph position="3"> As can be seen, the mention-based CEAF has a clear maximum when the number of proposed entities is close to the truth: at the penlaty value a221a99a6a30a50 a31 , the system produces a206a9a204a3a137 entities, very close to a206a9a138a3a7 , and the a107 a201 -CEAF achieves the maximum a7a33a50a69a210a8a138a9a204 . In contrast, the MUC F-measure increases almost monotonically as the system proposes fewer and fewer entities. In fact, the best system according to the MUC F-measure is the one proposing only a6a9a6a8a137 entities. This demonstrates a fundamental flaw of the MUC F-measure: the metric intrinsically favors a system producing fewer entities and therefore lacks of discriminativity.</Paragraph> <Paragraph position="4"> Now let us turn to ACE-value. Results in Table 4 are produced by a system trained on the ACE 2002 and 2004 training data and tested on a separate test set, which contains a204 a2 a137 reference entities. Both ACE-value and the mention-based CEAF penalizes systems over-producing or under-producing entities: ACE-value is maximum CEAF. The first column contains the penalty value in decreasing order. The second column contains the number of system-proposed entities. ACE-values are in percentage. The number of reference entities is a204 a2 a137 . when the penalty value is a221a23a7a33a50 a31 and CEAF is maximum when the penalty value is a221a23a7a33a50 a204 . However, the optimal CEAF system produces a1 a137a3a7 entities while the optimal ACE-value system produces a6a8a7 a2 a7 entities. Judging from the number of entities, the optimal CEAF system is closer to the &quot;truth&quot; than the counterpart of ACE-value. This is not very surprising since ACE-value is a weighted metric while CEAF treats each mention and entity equally. As such, the two metrics have very weak correlation.</Paragraph> <Paragraph position="5"> While we can make a statement such as &quot;the system with penalty a221a23a7a45a50 a204 puts about a210 a1 a50 a206 a4 mentions in right entities&quot;, it is hard to interpret the ACE-value numbers. Another difference is that CEAF is symmetric1, but ACE-Value is not. Symmetry is a desirable property. For example, when comparing inter-annotator agreement, a symmetric metric is independent of the order of two sets of input documents, while an asymmetric metric such as ACE-Value needs to state the input order along with the metric value.</Paragraph> </Section> </Section> class="xml-element"></Paper>