DESCRIPTION OF THE UPENN CAMP SYSTEM AS USED
FOR COREFERENCE
Breck Baldwin, Tom Morton, Amit Bagga, Jason Baldridge, Raman Chandraseker,
Alexis Dimitriadis, Kieran Snyder, Magdalena Wolska,
Institute for Research in Cognitive Science
3401 Walnut St. 400C
Philadelphia, PA 19104. USA
Phone: #28215#29 898-0329
Fax: #28215#29 573-9247
Email: fbreck,tsmortong@linc.cis.upenn.edu
Introduction
In this paper we present some advances made to the CAMP system since it's inception for MUC-6. Al-
though the infrastructure has been completely re-implemented, the architecture has remained fundamentally
the same#7Bconsequently we will focus some advances wehave made in our understanding of coreference and
then discuss the performance of the system.
Scoring Coreference Output
Scoring the performance of a system is an extremely important aspect of coreference algorithm performance.
The score for a particular run is the single strongest measure of how well the system is performing and
it can strongly determine directions for further improvements. In this paper, we present several di#0Berent
scoring algorithms and detail their respective strengths and weaknesses for varying classes of processing. In
particular, we describe and analyze the coreference scoringalgorithm used to evaluate the coreference systems
in the sixth Message Understanding Conference #28MUC-6#29#5BMUC-6, 95#5D. We also present two shortcomings
of this algorithm. In addition, we present a new coreference scoring algorithm, our B-CUBED algorithm,
whichwas designed to overcome the shortcomings of the MUC-6 algorithm.
Scoring in MUC-6#2F7: Vilain et al.
Prior to Vilain et al.'s coreference scoring algorithm #5BVilain, 95#5D there had been a graph based scoring
algorithm #28Sundheim et al.#29 which produced unintuitive results for even very simple cases. #5BVilain, 95#5D
substituted a model-theoretic scoring algorithm which produced very intuitive results for the type of scoring
desired in MUC-6. This algorithm computes computes the recall error by taking each equivalence class S
#28de#0Cned by the links in the answer key#29 and determining the number of coreference links m that would have
to be added to the response to place all the entities in S into the same equivalence class in the response.
Recall error then is the sum of m's divided by the number of links in the key. Precision error is computed
by reversing the roles of the answer key and the response.
The full details of the algorithm are discussed next.
The Model Theoretic Approach To The Vilain et. al Algorithm
1
1
The exposition of this scorer has been taken nearly entirely from #5BVilain, 95#5D
Figure 1: Truth
Figure 2: Response: Example 1
In the description of the model theoretic algorithm, the terms #5Ckey," and #5Cresponse" are de#0Cned in the
following way:
key refers to the manually annotated coreference chains #28the truth#29.
response refers to the coreference chains output by a system.
An equivalence set is the transitive closure of a coreference chain. The algorithm computes recall in the
following way.
First, let S be an equivalence set generated by the key, and let R
1
:::R
m
be equivalence classes generated
by the response. Then we de#0Cne the following functions over S:
#0F p#28S#29 is a partition of S relative to the response. Each subset of S in the partition is formed by
intersecting S and those response sets R
i
that overlap S. Note that the equivalence classes de#0Cned by
the response may include implicit singleton sets - these correspond to elements that are mentioned in
the key but not in the response. For example, say the key generates the equivalence class S = fABC
Dg, and the response is simply #3CA-B#3E. The relative partition p#28S#29 is then fABgfCgand fDg.
#0F c#28S#29 is the minimal number of #5Ccorrect" links necessary to generate the equivalence class S. It is clear
that c#28S#29 is one less than the cardinality of S, i.e.,
c#28S#29 = #28jSj,1#29 :
#0F m#28S#29 is the number of #5Cmissing" links in the response relative to the key set S. As noted above, this
is the number of links necessary to fully reunite any components of the p#28S#29 partition. We note that
this is simply one fewer than the number of elements in the partition, that is,
m#28S#29 = #28jp#28S#29j,1#29 :
Looking in isolation at a single equivalence class in the key, the recall error for that class is just the
number of missing links divided by the number of correct links, i.e.,
m#28S#29
c#28S#29
:
Recall in turn is
c#28S#29 , m#28S#29
c#28S#29
;
which equals
#28jSj,1#29 , #28jp#28S#29j,1#29
jSj,1
:
The whole expression can now be simpli#0Ced to
jSj,jp#28S#29j
jSj,1
: #281#29
Finally, extending this measure from a single key equivalence class to an entire set T simply requires
summing over the key equivalence classes. That is,
R
T
=
P
#28jS
i
j,jp#28S
i
#29j#29
P
#28jS
i
j,1#29
: #282#29
Precision is computed by switching the roles of the key and response in the above formulation.
Example
For example, let the key contain 3 equivalence classes as shown in Figure 1. Suppose Figure 2 shows a
response. From Figure 3#28I#29, the three equivalence classes in the truth, S
1
, S
2
, and S
3
, are f1, 2, 3, 4, 5g,
f6, 7g, and f8, 9, A, B, Cg respectively. And the partitions p#28S
1
#29, p#28S
2
#29, and p#28S
3
#29, with respect to the
response, shown in Figure 3#28II#29, are f1, 2, 3, 4, 5g, f6, 7g, and f8, 9, A, B, Cg respectively. Using equation 2,
the recall can now be calculated in the following way:
Recall =
#285 , 1#29+#282,1#29+#285,1#29
#285 , 1#29+#282,1#29+#285,1#29
=9=9 = 100#25 :
Similarly, if the roles of the key and the response are reversed, then the equivalence classes in the truth, S
1
,
and S
2
, are f1, 2, 3, 4, 5g and f6, 7, 8, 9, A, B, Cg#5D, and the partitions, p#28S
1
#29, and p#28S
2
#29, are f1, 2, 3, 4, 5g
and #5Bf6, 7gf8, 9, A, B, Cg#5D respectively #28Figure 3#28III#29#29. The precision can now be calculated as:
Precision =
#285 , 1#29 +#287 , 2#29
#285 , 1#29 +#287 , 1#29
=9=10 = 90#25 :
Shortcomings of the Vilain et. al Algorithm
Despite the advances of the model-theoretic scorer, it yields unintuitive results for some tasks. There are
two main reasons.
1. The algorithm does not give any credit for separating out singletons #28entities that occur in chains
consisting only of one element, the entity itself#29 from other chains whichhave been identi#0Ced. This
follows from the convention in coreference annotation of not identifying those entities that are markable
aspossiblycoreferentwith otherentities inthe text. Rather, entities areonlymarkedasbeingcoreferent
1
4
67
8
9
A
BC
1
5
67
A
BC
76
8
9
4
3
2
1
2
3
4
5
8
9
A
BC
Thin = response
Thick = key
Figure I
Thin = partition wrt/response
Thick = key
Figure II
Thin = response
Thick = partition wrt/key
Figure III
2
3
5
Figure 3: Equivalence Classes and Their Partitions For Example 1
if they actually are coreferent with other entities in the text. This potential shortcoming could be easily
enough overcome with di#0Berent annotation conventions and with minor changes to the algorithm, but
the decision to annotate singletons is a bit of a philosophical issue. On the one hand singletons do
form equivalence classes, and those equivalence classes are signi#0Ccant in that they are NOT coreferent
with another phrase in the text and they may play an important role in other equivalence classes out
side the immediate text #28as in cross document coreference#29. On the other hand, if coreference is viewed
as being about the relations between entities, then perhaps is makes little sense to annotate and score
singletons.
2. All errors are considered to be equal. The MUC scoring algorithm penalizes the precision numbers
equally for all types of errors. It is our position that, for certain tasks, some coreference errors do more
damage than others.
Consider the following examples: suppose the truth contains two large coreference chains and one small
one #28Figure 1#29, and suppose Figures 2 and 4 showtwo di#0Berent responses. We will explore two di#0Berent
precision errors. The #0Crst error will connect one of the large coreference chains with the small one
#28Figure 2#29. The second error occurs when the two large coreference chains are related by the errant
coreferent link #28Figure 4#29. It is our position that the second error is more damaging because, compared
to the #0Crst error, the second error makes more entities coreferent that should not be. This distinction
is not re#0Dected in the #5BVilain, 95#5D scorer which scores both responses as having a precision score of 90#25
#28Figure 6#29.
Revisions to the Algorithm: Our B-CUBED Algorithm
2
Our B-CUBED algorithm was designed to overcome the two shortcomings of the Vilain et. al algorithm.
Instead of looking at the links produced by a system, our algorithm looks at the presence#2Fabsence of entities
relative to each of the other entities in the equivalence classes produced. Therefore, we compute the precision
and recall numbers for eachentity in the document, which are then combined to produce #0Cnal precision and
recall numbers for the entire output. The formal model-theoretic version of our algorithm is discussed in the
next section.
2
The main idea of this algorithm was initially put forth by Alan W. Biermann of Duke University.
Figure 4: Response: Example 2
Precision
i
=
number of correct elements in the output chain containing entity
i
number of elements in the output chain containing entity
i
Recall
i
=
number of correct elements in the output chain containing entity
i
number of elements in the truth chain containing entity
i
Figure 5: De#0Cnitions for Precision and Recall for an Entity i
For an entity, i,we de#0Cne the precision and recall with respect to that entity in Figure 5.
The #0Cnal precision and recall numbers are computed by the following two formulae:
Final Precision =
N
X
i=1
w
i
#03 Precision
i
Final Recall =
N
X
i=1
w
i
#03 Recall
i
where N is the number of entities in the document, and w
i
is the weight assigned to entity i in the document.
It should be noted that the B-CUBED algorithm implicitly overcomes the #0Crst shortcoming of the Vilain et.
al algorithm by calculating the precision and recall numbers for eachentity in the document #28irrespectiveof
whether an entity is part of a coreference chain#29.
Di#0Berent weighting schemes produce di#0Berent versions of the algorithm. The choice of the weighting
scheme is determined by the task for which the algorithm is going to be used.
When coreference #28or cross-document coreference#29 is used for an information extraction task, where
information about every entity in an equivalence class is important, the weighting scheme assigns equal
weights for every entity i. For example, the weight assigned to eachentity in Figure 1 is 1#2F12. As shown in
Figure 6, the precision scores for responses in Figures 2 and 4 are 16#2F21 #2876#25#29 and 7#2F12 #2858#25#29 respectively,
using equal weights for all entities. Recall for both responses is 100#25. It should be noted that the algorithm
penalizes the precision numbers more for the error made in Figure 4 than the one made in Figure 2. As
evident from the two examples, this version of the B-CUBED algorithm #28using equal weights for eachentity#29
is a precision oriented algorithm i.e. it is sensitive to precision errors.
But, for an information retrieval #28IR#29 task, or a web search task, where an user is presented with classes of
documents that pertain to the same entity, the weighting scheme assigns equal weights to each equivalence
class. The weight for each entity within an equivalence class is computed by dividing the weight of the
equivalence class by the number of entities in that class. Recall is calculated by assigning equal weights to
each equivalence class in the truth while precision is calculated by assigning equal weights to each equivalence
class in the response. For example, in Figure 2, the weighting scheme assigns a weight of 1#2F10 to eachentity
in the #0Crst equivalence class, and a weight of 1#2F14 to each entity in the second equivalence class, when
calculating precision. Using this weighting scheme, the precision scores for responses in Figures 2 and 4 are
39#2F49 #2879.6#25#29 and 3#2F4 #2875#25#29 respectively. Recall for both responses is 100#25.
Output MUC Algorithm B-CUBED Algorithm #28equal weights for every entity#29
P:
9
10
#2890#25#29 P:
1
12
*#5B
5
5
+
5
5
+
5
5
+
5
5
+
5
5
+
2
7
+
2
7
+
5
7
+
5
7
+
5
7
+
5
7
+
5
7
#5D=
16
21
#2876#25#29
Example 1
R:
9
9
#28100#25#29 R:
1
12
*#5B
5
5
+
5
5
+
5
5
+
5
5
+
5
5
+
2
2
+
2
2
+
5
5
+
5
5
+
5
5
+
5
5
+
5
5
#5D = 100#25
P:
9
10
#2890#25#29 P:
1
12
*#5B
5
10
+
5
10
+
5
10
+
5
10
+
5
10
+
2
2
+
2
2
+
5
10
+
5
10
+
5
10
+
5
10
+
5
10
#5D=
7
12
#2858#25#29
Example 2
R:
9
9
#28100#25#29 R:
1
12
*#5B
5
5
+
5
5
+
5
5
+
5
5
+
5
5
+
2
2
+
2
2
+
5
5
+
5
5
+
5
5
+
5
5
+
5
5
#5D = 100#25
Figure 6: Scores of Both Algorithms on the Examples
Comparing these numbers to the ones obtained by using the version of the algorithm which assigns equal
weights to eachentity, one can see that the currentversion is much less sensitive to precision errors. Although
the currentversion of the algorithm does penalize the precision numbers for the error in Figure 4 more than
the error made in Figure 2, it is less severe than the earlier version.
The Model Theoretic Approach To The B-CUBED Algorithm
Let S be an equivalence set generated by the key, and let R
1
:::R
m
be equivalence classes generated by the
response. Then we de#0Cne the following functions over S:
#0F p#28S#29 is a partition of S with respect to the response, i.e. p#28S#29 is a set of subsets of S formed by
intersecting S with those response sets R
i
that overlap S. Let p#28S#29 = fP
1
;P
2
;:::;P
m
gwhere eachP
j
is a subset of S.
#0F m
j
#28S#29 is the number of elements that are missing from eachP
j
relative to the key set S. Therefore,
m
j
#28S#29 = #28jSj,jP
j
j#29:
Since the B-CUBED algorithm looks at the presence#2Fabsence of entities relative to each of the other
entities, the number of missing entities in an entire equivalence set is calculated by adding the number of
missing entities with respect to eachentity in that equivalence set. Therefore, the number of missing entities
for the entire set S is
m
X
j=1
X
for eache2P
j
m
j
#28S#29 :
The recall error is simply the number of missing entities divided by the number of entities in the equiv-
alence set, i.e.,
m
j
#28S#29
jSj
:
Since the algorithm looks at eachentity in an equivalence set, the recall error for that entire set is
1
jSj
m
X
j=1
X
for eache2P
j
m
j
#28S#29
jSj
:
Recall in turn is
1 ,
1
jSj
m
X
j=1
X
for eache2P
j
m
j
#28S#29
jSj
;
which equals
1 ,
P
m
j=1
P
for eache2P
j
m
j
#28S#29
jSj
2
:
The whole expression can now be simpli#0Ced to
1 ,
P
m
j=1
P
for eache2P
j
jSj,jP
j
j
jSj
2
:
Moreover, the measure can be extended from a single key equivalence class to a set T = fS
1
;S
2
;:::;S
n
g
of equivalence classes. Therefore, the recall R
i
for an equivalence class S
i
equals
R
i
=1,
P
m
j=1
P
for eache2P
ij
jS
i
j,jP
ij
j
jS
i
j
2
;
where P
ij
is the jth element of the partition p#28S
i
#29, and, hence, is a subset of S
i
.
The recall numbers calculated for each class can now be combined in various ways to produce the #0Cnal
recall. Di#0Berent versions of the algorithm are obtained by using di#0Berent combination strategies. If equal
weights are assigned to each class, the version of the algorithm produced is exactly the same as the version
of the informal algorithm which assigns equal weights to each class, as described in the previous section. In
other words, the #0Cnal recall is an average of the recall numbers for each equivalence class, i.e.,
R
T
=
1
n
n
X
i=1
R
i
:
To obtain the version of the informal algorithm which assigns equal weights to eachentity, the #0Cnal recall
is computed by calculating the weighted average of the recall numbers for each equivalence class where the
weights are decided by the number of entities in each class, i.e.,
R
T
=
n
X
i=1
jS
i
j
P
n
j=1
jS
j
j
R
i
:
Finally, as in the case of the Vilain et. al algorithm, the precision numbers are calculated by reversing
the roles of the key and the response in the above formulation.
Task Relative Strengths and Weaknesses of the Two Algorithms
The Vilain et. al algorithm is useful for applications#2Ftasks that use single coreference relations at a time
rather than resulting equivalence classes. For our development in the coreference task, the two algorithms
provide distinct perspectives on system performance. Vilain et. al provide a strong diagnostic for errors
that re#0Dect pairwise decisions done by the system. Our visual display techniques emphasize just this sort of
processing.
Our total score under the Vilain algorithm, with a somewhat fuzzier extent requirement and stricter
requirement for links is 81#25 precision and 45#25 recall.
The same #0Cles using the B3 algorithm resulted in 78#25 precision and 31#25 recall. The precision numbers
are comparable which indicates that our goal of high precision is supported under both views of the data.
The 14#25 drop in recall was however unexpected. The reason is fairly straight forward#7B our system is not
doing a good job of relating large equivalence classes. This is the converse of penalizing the system for
positing incorrect links that result in larger equivalent classes than smaller ones.
The drop in recall in the B3 scorer also suggests a distinct class of coreference resolution procedure that
we could investigate#7Bgrowing of large equivalence classes via an entity merging model which eschewed the
standard left-to-right processing strategy of most coreference resolution systems. If such a procedure can
reliably grow medium sized equivalence classes into large ones, then the recall #0Cgures will improve under the
B3 scorer. The Vilain et. al scorer notes no di#0Berence between correctly relating two singleton equivalence
classes and correctly relating two large equivalence classes.
Since large equivalence classes tend to include topically signi#0Ccantentities for documents, correctly iden-
tifying them is perhaps crucial to applications like summarization and information extraction.
Developing with the Vilain et al algorithm
The below analysis re#0Dects howwe assesed the individual contributions of the components during devel-
opment. Since the B3 algorithm was not yet implemented, we did not use it for development.
Our explicit goal was to maximize recall at a precision level of 80#25. We feel that this level of precision
provides enough accuracy to drive a range of coreference dependent applications#7Bmost important for us was
query sensitive text summarization. Our overall approach was to break down coreference resolution into
concrete subprograms that resolved a limited class of coreference well. Each component could be scored
separately by either running it in isolation, or by blocking coreference from subsequent processes. Belowwe
discuss each component in the order of execution.
Genre Speci#0Cc Coreference
A problematic aspect of any new genre of data is the existence of idiosyncratic classes of coreference and
the MUC-7 data was particularly troubling since very oddly formatted text was fair game for coreference.
For example, the strings `HUGHES' and `FCC' in `#3CSLUG fv=tia-z#3E BC-HUGHES-FCC-BLOOM #3C#2FSLUG#3E' are
coreferent with the same strings in `#3CPREAMBLE#3EBC-HUGHES-FCC-BLOOM...' whichwas outside the scope of
our linguistic tools. Simple programs were written to recognize this sort of coreference. The performance by
the Vilain scorer is 4.2#25 recall 67.5#25 precision.
This performance is well below what we observed in training data#7Bthe precision was 85-90#25 for similarly
sized collections. Perhaps part of the problem was that we never quite grasped why some but not all these
all CAPS strings were not coreferent.
La Hack 3
La Hack is a carry over from our original MUC-6 system, and it is responsible for identi#0Ccation of proper
noun coreference. This component is indirectly helped by IBM's named entity tool 'Textract' which #0Cnd
extents of named entities in addition to assigning them properties like 'is person', 'is company'. It is the
foundation upon which our coreference annotation is built#7Bmistakes here can be devastating for the rest of
the system. In MUC-6, La Hack performed at 29#25 Recall and 86#25 precision, but it faired somewhat worse
in MUC-7 with, 24.0#25 precision 80.0#25 recall.
We observed that the New York Times data had far less regular honori#0Cc use and corporate designator
use than the MUC-6 corpus based on Wall Street Journal. As a result, there were fewer reliable indicators
of proper names.
Highly Syntactic Coreference
This component asserts coreference between phrases that are in appositive relations or that are in predi-
cate nominal relations. Wewere quite surprised at how poorly this component performed since we expected
performance to be above the 80#25 precision cuto#0B. Our actual performance is 3.3#25 precision 64.0#25 recall.
Quoted Speech
Quoted speech has idiosyncratic patterns of use that are better solved out side the scope of our standard
coreference resolution module. We expected performance to be above 90#25 precision and were pleased with
2.6#25 recall and 86.8#25 precision. This module is a good example of how the coreference problem can be
fruitfully broken up into sub-parts of individually high precision.
CogNIAC Proper Noun Resolution
CogNIAC is the most general purpose coreference resolution component of the system. It features a
fairly sophisticated salience model and property con#0Cdence model to preorder order the set of candidate
antecedents. The importance of the preorder is that it allows ties between equally salientantecedents#7Band
in the case of ties the anaphor is not resolved.
When de#0Cciencies were noted with the output of LaHack, the simplest solution was to add a proper noun
resolution component to CogNIAC. In the end this addition added a bit of recall but with fairly low precision
with 1.2#25 recall and 65.2#25 precision.
CogNIAC Common Noun Resolution
Common noun coreference is an important part of coreference, but it is very di#0Ecult to accurately resolve.
Our MUC-6 system had fairly poor performance with 10#25 recall and a precision of 48#25. Wewere surprised
with an increase in performance over training data #2878#25 precision#29 with 7.1#25 recall 90.7#25 precision.
Common noun anaphora is probably one of the most trying classes of coreference to annotate as a
human. This is due to many di#0Ecult judgment calls required on the part of the human judges, and this was
re#0Dected in the consistency of annotation in the training data. We found it challenging to develop on the
training data because the system would #0Cnd what we considered to be reasonable instances of coreference
that the annotator had not made coreferent. We believe that common noun anaphora is a large source of
inter-annotator disagreement.
CogNIAC Pronouns
The pronominal system performed under our goal of 80#25 precision. In training, we found that wewere
constantly balancing the ability of pronouns to i#29 refer uniquely, and ii#29 have all entities have the correct
property. We adopted a property con#0Cdence model that encouraged recall over precision. This meant that a
proper noun like 'Mrs. Fields' would be both potentially an antecedent to feminine pronouns, and pronouns
that referred to companies. A salience model was then applied to these overloaded entities and pronominal
resolution served to be a word-sense disambiguation problem in addition to a coreference resolution problem.
Our performance was 4.5#25 recall and 70.0#25 precision.
Conclusions
One of the stronger conclusions that wehave come to regarding coreference is that there is an apparent
linear trade-o#0B between precision and recall given the performance of other systems with the coreference task.
Our suspicion is that the same can be said with the B3 scorer but that will havetoawait experimentation.
This is a positive result in its self because we now can choose from multiple types of coreference systems
depending on our task. We consider high precision systems to be more useful for the types of systems that
we build, but, it has not been clear that high precision systems were possible.
We also believe that the space of high precision 'contributors' to coreference is not exhausted. We doubt
that there are any 10#25 recall#2F80#25 precision subcomponents that wehave not already explored, but there
are certainly 1-5#25 recall opportunities. Howwell they will sum to the recall of the entire system is unknown,
but there is room for improvement.

REFERENCES
#5BBagga, 98a#5D Bagga, Amit. How Much Processing Is Required for Cross-Document Coreference?, this volume.
#5BVilain, 95#5D Vilain, Marc, et al. A Model-Theoretic Coreference Scoring Scheme, Proceedings of the Sixth Message Understanding Conference #28MUC-6#29, pp. 45-52, November 1995.
#5BMUC-6, 95#5D Proceedings of the Sixth Message Understanding Conference #28MUC-6#29,November 1995, San Mateo: Morgan Kaufmann.
