Combining Sample Selection and Error-Driven Pruning for
Machine Learning of Coreference Rules
Vincent Ng and Claire Cardie
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
a0 yung,cardie
a1 @cs.cornell.edu
Abstract
Most machine learning solutions to noun
phrase coreference resolution recast the
problem as a classification task. We ex-
amine three potential problems with this
reformulation, namely, skewed class dis-
tributions, the inclusion of “hard” training
instances, and the loss of transitivity in-
herent in the original coreference relation.
We show how these problems can be han-
dled via intelligent sample selection and
error-driven pruning of classification rule-
sets. The resulting system achieves an F-
measure of 69.5 and 63.4 on the MUC-
6 and MUC-7 coreference resolution data
sets, respectively, surpassing the perfor-
mance of the best MUC-6 and MUC-7
coreference systems. In particular, the
system outperforms the best-performing
learning-based coreference system to date.
1 Introduction
Noun phrase coreference resolution refers to the
problem of determining which noun phrases (NPs)
refer to each real-world entity mentioned in a doc-
ument. Machine learning approaches to this prob-
lem have been reasonably successful, operating pri-
marily by recasting the problem as a classification
task (e.g. Aone and Bennett (1995), McCarthy and
Lehnert (1995), Soon et al. (2001)). Specifically, an
inductive learning algorithm is used to train a classi-
fier that decides whether or not two NPs in a docu-
ment are coreferent. Training data are typically cre-
ated by relying on coreference chains from the train-
ing documents: training instances are generated by
pairing each NP with each of its preceding NPs; in-
stances are labeled as positive if the two NPs are in
the same coreference chain, and labeled as negative
otherwise.1
A separate clustering mechanism then coordinates
the possibly contradictory pairwise coreference clas-
sification decisions and constructs a partition on the
set of NPs with one cluster for each set of corefer-
ent NPs. Although, in principle, any clustering algo-
rithm can be used, most previous work uses a single-
link clustering algorithm to impose coreference par-
titions.2 An implicit assumption in the choice of the
single-link clustering algorithm is that coreference
resolution is viewed as anaphora resolution, i.e. the
goal during clustering is to find an antecedent for
each anaphoric NP in a document.3
Three intrinsic properties of coreference4, how-
ever, make the formulation of the problem as a
classification-based single-link clustering task po-
tentially undesirable:
Coreference is a rare relation. That is, most
NP pairs in a document are not coreferent. Con-
1Two NPs are in the same coreference chain if and only if
they are coreferent.
2One exception is Kehler’s work on probabilistic corefer-
ence (Kehler, 1997), in which he applies Dempster’s Rule of
Combination (Dempster, 1968) to combine all pairwise proba-
bilities of coreference to form a partition.
3In this paper, we consider an NP anaphoric if it is part of a
coreference chain but is not the head of the chain.
4Here, we use the term coreference loosely to refer to either
the problem or the binary relation defined on a set of NPs. The
particular choice should be clear from the context.
                                            Association for Computational Linguistics.
                      Language Processing (EMNLP), Philadelphia, July 2002, pp. 55-62.
                         Proceedings of the Conference on Empirical Methods in Natural
sequently, generating training instances by pairing
each NP with each of its preceding NPs creates
highly skewed class distributions, in which the num-
ber of positive instances is overwhelmed by the
number of negative instances. For example, the stan-
dard MUC-6 and MUC-7 (1995; 1998) coreference
data sets contain only 2% positive instances. Un-
fortunately, learning in the presence of such skewed
class distributions remains an open area of research
in the machine learning community (e.g. Pazzani et
al. (1994), Fawcett (1996), Cardie and Howe (1997),
Kubat and Matwin (1997)).
Coreference is a discourse-level problem with dif-
ferent solutions for different types of NPs. The
interpretation of a pronoun, for example, may be de-
pendent only on its closest antecedent and not on the
rest of the members of the same coreference chain.
Proper name resolution, on the other hand, may be
better served by ignoring locality constraints alto-
gether and relying on string-matching or more so-
phisticated aliasing techniques. Consequently, gen-
erating positive instances from all pairs of NPs from
the same coreference chain can potentially make the
learning task harder: all but a few coreference links
derived from any chain might be hard to identify
based on the available contextual cues.
Coreference is an equivalence relation. Recast-
ing the problem as a classification task precludes en-
forcement of the transitivity constraint. After train-
ing, for example, the classifier might determine that
A is coreferent with B, and B with C, but that A and
C are not coreferent. Hence, the clustering mecha-
nism is needed to coordinate these possibly contra-
dictory pairwise classifications. In addition, because
the coreference classifiers are trained independent of
the clustering algorithm to be used, improvements in
classification accuracy do not guarantee correspond-
ing improvements in clustering-level accuracy, i.e.
overall performance on the coreference resolution
task might not improve.
This paper examines each of the above issues.
First, to address the problem of skewed class dis-
tributions, we apply a technique for negative in-
stance selection similar to that proposed in Soon et
al. (2001). In contrast to results reported there, how-
ever, we show empirically that system performance
increases noticeably in response to negative example
selection, with increases in F-measure of 3-5%.
Second, in an attempt to avoid the inclusion of
“hard” training instances, we present a corpus-based
method for implicit selection of positive instances.
The approach is a fully automated variant of the ex-
ample selection algorithm introduced in Harabagiu
et al. (2001). With positive example selection, sys-
tem performance (F-measure) again increases, by
12-14%.
Finally, to more tightly tie the classification- and
clustering-level coreference decisions, we propose
an error-driven rule pruning algorithm that opti-
mizes the coreference classifier ruleset with respect
to the clustering-level coreference scoring function.
Overall, the use of pruning boosts system perfor-
mance from an F-measure of 69.3 to 69.5, and from
57.2 to 63.4 for the MUC-6 and MUC-7 data sets,
respectively, enabling the system to achieve perfor-
mance that surpasses that of the best MUC corefer-
ence systems by 4.6% and 1.6%. In particular, the
system outperforms the best-performing learning-
based coreference system (Soon et al., 2001) by
6.9% and 3.0%.
The remainder of the paper is organized as fol-
lows. In sections 2 and 3, we present the machine
learning framework underlying the baseline corefer-
ence system and examine the effect of negative sam-
ple selection. Section 4 presents our corpus-based
algorithm for selection of positive instances. Section
5 describes and evaluates the error-driven pruning
algorithm. We conclude with future work in section
6.
2 The Machine Learning Framework for
Coreference Resolution
Our machine learning framework for coreference
resolution is a standard combination of classification
and clustering, as described above.
Creating an instance. An instance in our machine
learning framework is a description of two NPs in a
document. More formally, let NPa2a4a3 be the a5 th NP in
document a6 . An instance formed from NPa7 a3 and NPa8 a3
is denoted by a9a11a10a13a12a15a14a17a16a19a18a21a20a12a22a14a24a23a25a18a27a26 . A valid instance is an
instance a9 a10a13a12a22a14 a16a19a18 a20a12a22a14 a23a28a18 a26 such that NPa7 a3 precedes NPa8 a3 .5
Following previous work (Aone and Bennett (1995),
5By definition, exactly
a29a30 valid instances can be created
from a31 NPs in a given document.
Soon et al. (2001)), we assume throughout the paper
that only valid instances will be generated and used
for training and testing. Each instance consists of 25
features, which are described in Table 1.6 The clas-
sification associated with a training instance is one
of COREFERENT or NOT COREFERENT depending
on whether the NPs co-refer in the associated train-
ing text.7
Building an NP coreference classifier. We use
RIPPER (Cohen, 1995), an information gain-based
propositional rule learning system, to train a classi-
fier that, given a test instance a9a11a10a13a12a22a14a32a16a19a18a21a20a12a22a14a24a23a25a18a27a26 , decides
whether or not NPa7 a3 and NPa8 a3 are coreferent. Specifi-
cally, RIPPER sequentially covers the positive train-
ing instances and induces a ruleset that determines
when two NPs are coreferent. When none of the
rules in the ruleset is applicable to a given NP pair, a
default rule that classifies the pair as not coreferent
is automatically invoked. The output of the classifier
is either COREFERENT or NOT COREFERENT along
with a number between 0 and 1 that indicates the
confidence of the classification.
Applying the classifier to create coreference
chains. After training, the resulting ruleset is used
by a best-first clustering algorithm to impose a par-
titioning on all NPs in the test texts, creating one
cluster for each set of coreferent NPs. Texts are
processed from left to right. Each NP encountered,
NPa8 a3 , is compared in turn to each preceding NP, NPa7 a3 ,
from right to left. For each pair, a test instance is
created as during training and is presented to the
coreference classifier. The NP with the highest con-
fidence value among the preceding NPs that are clas-
sified as being coreferent with NPa8 a3 is selected as the
antecedent of NPa8 a3 ; otherwise, no antecedent is se-
lected for NPa8 a3 .
3 Negative Sample Selection
As noted above, skewed class distributions arise
when generating all valid instances from the train-
ing texts. A number of methods for handling skewed
distributions have been proposed in the machine
learning literature, most of which modify the learn-
6See Ng and Cardie (2002) for a detailed description of the
features.
7In all of the work presented here, NPs are identified, and
feature values computed entirely automatically.
Algorithm NEG-SELECT(NEG: set of all possible
negative instances)
for a9a33a10a34a12a22a14a35a16a36a18a21a20a12a22a14a37a23a25a18a4a26a39a38 NEG do
if NPa8 a3 is anaphoric then
if NPa7 a3 precedes a40 (NPa8 a3 ) then
NEG := NEG a41a35a42a11a9a33a10a34a12a22a14a37a16a19a18a11a20a12a22a14a37a23a25a18a4a26a44a43
else
NEG := NEG a41a35a42a33a9a11a10a13a12a15a14a32a16a36a18a21a20a12a22a14a37a23a25a18a4a26a25a43
return NEG
Figure 1: The NEG-SELECT algorithm
ing algorithm to incorporate a loss function with a
much larger penalty for minority class errors than for
instances from the majority classes (e.g. Gordon and
Perlis (1989), Pazzani et al. (1994)). We investigate
here a different approach to handling skewed class
distributions — negative sample selection, i.e. the
selection of a smaller subset of negative instances
from the set of available negative instances. In the
case of NP coreference, we hypothesize that reduc-
ing the number of negative instances will improve
recall but potentially reduce precision: intuitively,
the existence of fewer negative instances should al-
low RIPPER to more liberally induce positive rules.
We propose a method for negative sample selection
that, for each anaphoric NP, NPa8 a3 , retains only those
negative instances for non-coreferent NPs that lie
between NPa8 a3 and its farthest preceding antecedent,
a40 (NPa8 a3 ). The algorithm for negative sample selec-
tion, NEG-SELECT, is shown in Figure 1. NEG-
SELECT takes as input the set of all possible neg-
ative instances in the training texts, i.e. the set of
valid instances a9a33a10a34a12a22a14a35a16a36a18a21a20a12a22a14a37a23a25a18a4a26 such that NPa7 a3 and NPa8 a3
are not in the same coreference chain.
The intuition behind this approach is very simple.
Let a45 (NPa8 a3 ) be the set of preceding antecedents of
NPa8 a3 , and a46 (NPa7 a3 ,NPa8 a3 ) be the set consisting of NPs
NPa7 a3 , NPa10a7a48a47a50a49 a26 a3 ,a51a37a51a52a51 , NPa8 a3 . Recall that the goal dur-
ing clustering is to compute, for each NP NPa8 a3 , the
set a45 (NPa8 a3 ) from which the element with the high-
est confidence is selected as the antecedent of NPa8 a3 .
Since (1) a45 (NPa8 a3 ) is a subset of a46 (a40 (NPa8 a3 ),NPa8 a3 )8 and
8We define
a53 (a54 (NP
a23a25a18 ),NPa23a25a18 ) to be the empty set if
a54 (NP
a23a25a18 )
does not exist (i.e. NPa23a25a18 is not anaphoric).
Feature Type Feature Description
Lexical PRO STR C if both NPs are pronominal and are the same string; else I.
PN STR C if both NPs are proper names and are the same string; else I.
SOON STR NONPRO C if both NPs are non-pronominal and the string of NPa16a19a18 matches that of NPa23a25a18 ;
else I.
Grammatical PRONOUN 1 Y if NPa16a19a18 is a pronoun; else N.
PRONOUN 2 Y if NPa23a25a18 is a pronoun; else N.
DEMONSTRATIVE 2 Y if NPa23a25a18 starts with a demonstrative such as “this,” “that,” “these,” or “those;”
else N.
BOTH PROPER NOUNS C if both NPs are proper names; NA if exactly one NP is a proper name; else I.
NUMBER C if the NP pair agree in number; I if they disagree; NA if number information
for one or both NPs cannot be determined.
GENDER C if the NP pair agree in gender; I if they disagree; NA if gender information
for one or both NPs cannot be determined.
ANIMACY C if the NPs match in animacy; else I.
APPOSITIVE C if the NPs are in an appositive relationship; else I.
PREDNOM C if the NPs form a predicate nominal construction; else I.
BINDING I if the NPs violate conditions B or C of the Binding Theory; else C.
CONTRAINDICES I if the NPs cannot be co-indexed based on simple heuristics; else C. For
instance, two non-pronominal NPs separated by a preposition cannot be co-
indexed.
SPAN I if one NP spans the other; else C.
MAXIMALNP I if both NPs have the same maximal NP projection; else C.
SYNTAX I if the NPs have incompatible values for the BINDING, CONTRAINDICES, SPAN
or MAXIMALNP constraints; else C.
INDEFINITE I if NPa23a25a18 is an indefinite and not appositive; else C.
PRONOUN I if NPa16a19a18 is a pronoun and NPa23a25a18 is not; else C.
EMBEDDED 1 Y if NPa16a19a18 is an embedded noun; else N.
TITLE I if one or both of the NPs is a title; else C.
Semantic WNCLASS C if the NPs have the same WordNet semantic class; I if they don’t; NA if the
semantic class information for one or both NPs cannot be determined.
ALIAS C if one NP is an alias of the other; else I.
Positional SENTNUM Distance between the NPs in terms of the number of sentences.
Others PRO RESOLVE C if NPa23a28a18 is a pronoun and NPa16a36a18 is its antecedent according to a naive pronoun
resolution algorithm; else I.
Table 1: Feature Set for the Coreference System. The feature set contains relational and non-relational features. Non-
relational features test some property P of one of the NPs under consideration and take on a value of YES or NO depending on
whether P holds. Relational features test whether some property P holds for the NP pair under consideration and indicate whether
the NPs are COMPATIBLE or INCOMPATIBLE w.r.t. P; a value of NOT APPLICABLE is used when property P does not apply.
(2) NPa8 a3 is compared to each preceding NP from
right to left by the clustering algorithm, it follows
that the set of negative instances whose classifica-
tions the classifier needs to determine in order to
compute a45 (NPa8 a3 ) is a superset of the set of instances
a55 (NP
a8 a3 ) formed by pairing NPa8 a3 with each of its non-
coreferent preceding NPs in a46 (a40 (NPa8 a3 ),NPa8 a3 ). Con-
sequently,
a8 a20a3
a55 (NP
a8 a3 ) is the minimal set of (nega-
tive) instances whose classifications will be required
during clustering. In principle, to perform the classi-
fications accurately, the classifier needs to be trained
on the corresponding set of negative instances from
the training set, which is
a8 a20a3
a55 (NP
a8 a3 ), where NPa8 a3
is now the a56 th NP in training document a6 . NEG-
SELECT is designed essentially to compute this set.
Next, we examine the effects of this minimalist ap-
proach to negative sample selection.
Evaluation. We evaluate the coreference system
with negative sample selection on the MUC-6 and
MUC-7 coreference data sets in each case, train-
ing the coreference classifier on the 30 “dry run”
texts, and applying the coreference resolution algo-
rithm on the 20–30 “formal evaluation” texts. Re-
sults are shown in rows 1 and 2 of Table 2 where
performance is reported in terms of recall, precision,
and F-measure using the model-theoretic MUC scor-
ing program (Vilain et al., 1995). The Baseline sys-
tem employs no sample selection, i.e. all available
training examples are used. Row 2 shows the per-
formance of the Baseline after incorporating NEG-
SELECT. With negative sample selection, the per-
centage of positive instances rises from 2% to 8%
for the MUC-6 data set and from 2% to 7% for the
MUC-7 data set. For both data sets, we see statis-
tically significant increases in recall and statistically
significant, but much larger drops in precision.9 The
resulting F-measure scores, however, increase non-
trivially from 52.4 to 55.2 (for MUC-6), and from
41.3 to 46.0 (for MUC-7).10
4 Positive Sample Selection
Since not all of the coreference relationships de-
rived from coreference chains are equally easy to
identify, training a classifier using all possible coref-
erence relationships can potentially lead to the in-
duction of inaccurate rules. Given the observa-
tion that one antecedent is sufficient to resolve an
anaphor, it may be desirable to learn only from easy
positive instances. Similar observations are made
by Harabagiu et al. (2001), who point out that in-
telligent selection of positive instances can poten-
tially minimize the amount of knowledge required
to perform coreference resolution accurately. They
assume that the easiest types of coreference rela-
tionships to resolve are those that occur with high
frequencies in the data. Consequently, they mine
by hand three sets of coreference rules for cov-
ering positive instances from the training data by
finding the coreference knowledge satisfied by the
largest number of anaphor-antecedent pairs. While
the Harabagiu et al. algorithm attempts to mine
easy coreference rules from the data by hand, nei-
ther the rule creation process nor stopping condi-
tions are precisely defined. In addition, a lot of
human intervention is required to derive the rules.
In this section, we describe an automatic positive
sample selection algorithm that coarsely mimics the
Harabagiu et al. algorithm by finding a confident an-
tecedent for each anaphor. Overall, our goal is to
avoid the inclusion of hard training instances by au-
tomating the process of deriving easy coreference
rules from the data.
The Algorithm. The positive sample selection al-
gorithm, POS-SELECT, is shown in Figure 2. It as-
sumes the existence of a rule learner, L, that pro-
duces an ordered set of positive rules. POS-SELECT
9Chi-square statistical significance tests are applied to
changes in recall and precision throughout the paper. Unless
otherwise noted, reported differences are at the 0.05 level or
higher. The chi-square test is not applicable to F-measure.
10The F-measure score computed by the MUC scoring pro-
gram is the harmonic mean of recall and precision.
Algorithm POS-SELECT(L: positive rule learner,
T: set of training instances)
FinalRuleSet := a57 ;
AnaphorSet := a57 ;
BestRule := NIL;
repeat
BestRule := best rule among the ranked set
of positive rules induced on T using L
FinalRuleSet := FinalRuleSet a58 BestRule
// collect anaphors from instances that
// are correctly covered by BestRule
for a9a11a10a13a12a22a14a32a16a19a18a11a20a12a22a14a37a23a25a18a4a26a59a38 T do
if a9a33a10a34a12a22a14a32a16a19a18a60a20a12a15a14a24a23a25a18a21a26 is covered by BestRule and
class( a9a33a10a34a12a22a14a35a16a36a18a21a20a12a22a14a32a23a28a18a27a26 ) a61 COREFERENT then
AnaphorSet := AnaphorSet a58a62a42 NPa8 a3 a43
// remove instances associated with the
// anaphors covered by BestRule
for a9 a10a13a12a22a14 a16a19a18 a20a12a22a14 a23a25a18 a26 a38 T do
if NPa8 a3a63a38 AnaphorSet then
T a64a65a61 T a41a66a42a67a9 a10a34a12a22a14a17a16a19a18a21a20a12a22a14a37a23a28a18a60a26 a43
until L cannot induce any rule for the positives.
return FinalRuleSet
Figure 2: The POS-SELECT algorithm
first uses L to induce a ruleset on the training in-
stances and picks the first rule from the ruleset. For
any training instance a9 a10a34a12a22a14 a16a36a18 a20a12a22a14 a23a28a18 a26 correctly covered
by this rule, an antecedent NPa7 a3 has been identi-
fied for the anaphor NPa8 a3 . As a result, all (positive
and negative) training instances formed with NPa8 a3
as the anaphor are no longer needed and are sub-
sequently removed from the training data.11 The
process is repeated until L cannot induce a rule to
cover the remaining positive instances. The output
of POS-SELECT is a set of positive rules selected
during each iteration of the algorithm. Hence, posi-
tive sample selection in POS-SELECT is implicit in
the sense that it is embedded within the rule induc-
tion process.
Evaluation. Results are shown in rows 3 and 4 of
Table 2. As in the previous experiments, the rule
learner is RIPPER. We run the system twice, first
11We speculate that retaining the negative instances would
hurt performance, but this remains to be verified.
Experiments Algorithms used MUC-6 MUC-7
R P F R P F
Baseline — 40.7 73.5 52.4 27.2 86.3 41.3
Neg-Only NEG-SELECT 46.5 67.8 55.2 37.4 59.7 46.0
Pos-Only POS-SELECT 53.1 80.8 64.1 41.1 78.0 53.8
Combined NEG-SELECT+POS-SELECT 63.4 76.3 69.3 59.5 55.1 57.2
Pruning NEG-SELECT+POS-SELECT+RULE-SELECT 63.3 76.9 69.5 54.2 76.3 63.4
More Training NEG-SELECT+POS-SELECT 64.8 70.6 67.6 60.0 55.7 57.8
Table 2: Effects of sample selection and error-driven pruning.
with POS-SELECT only and then with both POS-
SELECT and NEG-SELECT. With POS-SELECT
only, the system achieves an F-measure of 64.1
(for MUC-6) and 53.8 (for MUC-7). When POS-
SELECT and NEG-SELECT are used in combina-
tion, however, the system achieves an F-measure of
69.3 (for MUC-6) and 57.2 (for MUC-7).
Discussion. The experimental results are largely
consistent with our hypothesis. System performance
improves dramatically with positive sample selec-
tion using POS-SELECT both in the absence and
presence of negative sample selection. Without neg-
ative sample selection, F-measure increases from
52.4 to 64.1 (for MUC-6), and from 41.3 to 53.8 (for
MUC-7). Similarly, with negative sample selection,
F-measure increases from 55.2 to 69.3 (for MUC-6),
and from 46.0 to 57.2 (for MUC-7). In addition, our
results indicate that applying both negative and pos-
itive sample selection leads to better performance
than applying positive sample selection alone: F-
measure increases from 64.1 to 69.3, and from 53.8
to 57.2 for the MUC-6 and MUC-7 data sets, respec-
tively. Nevertheless, reducing the number of neg-
ative instances (via negative sample selection) im-
proves recall but damages precision: we see sta-
tistically significant gains in recall and statistically
significant drops in precision for both data sets. In
particular, precision drops precipitously from 78.0
to 55.1 for the MUC-7 data set. We hypothesize
that POS-SELECT does not guarantee that hard pos-
itive instances will be avoided and that the inclu-
sion of these hard instances is responsible for the
poorer precision of the system. Anaphors that do not
have easy antecedents can never be removed auto-
matically via the induction of new rules using POS-
SELECT. In fact, RIPPER will possibly induce rules
to handle these hard instances as long as such kind of
anaphors occur sufficiently frequently in the data set
relative to the number of negative instances.12 Al-
though it might be beneficial to acquire these rules
at the classification level (according to the learning
algorithm), they can be detrimental to system per-
formance at the clustering level, especially if the
rules cover a large number of examples with a lot
of exceptions. Consequently, it is necessary to know
which rules are worthy of keeping at the clustering
level and not the classification level. We will address
this issue in the next section.
5 Pruning the Coreference Ruleset
As noted in the introduction, machine learning ap-
proaches to coreference resolution that rely only on
pairwise NP coreference classifiers will not neces-
sarily enforce the transitivity constraint inherent in
the coreference relation. Although approaches to
coreference resolution that rely only on clustering
could easily enforce transitivity (as in Cardie and
Wagstaff (1999)), they have not performed as well
as state-of-the-art approaches to coreference. In this
section, we propose a method for resolving this con-
flict: we introduce an error-driven rule pruning al-
gorithm that considers rules induced by the coref-
erence classifier and discards those that cause the
ruleset to perform poorly with respect to the global,
clustering-level coreference scoring function.
The Algorithm. The error-driven pruning algo-
rithm is inspired by the backward elimination al-
gorithm commonly used for feature selection (see
Blum and Langley (1997)) and is shown in Figure
3. The algorithm, RULE-SELECT, takes as input a
ruleset learned from a training corpus for perform-
ing coreference resolution, a pruning corpus (dis-
joint from the training corpus), and a clustering-level
12More precisely, RIPPER will induce a new rule if the rule
is more than 50% accurate and the resulting description length
is fewer than 64 bits larger than the smallest description length
obtained so far.
Algorithm RULE-SELECT(R: ruleset,
P: pruning corpus,
S: scoring function)
BestScore a64a68a61 score of the coreference system
using R on P w.r.t. S;
r a64a68a61 NIL;
repeat
r := the rule in R whose removal yields a
ruleset with which the coreference system
achieves the best score b on P w.r.t. S.
if b a69 BestScore then
BestScore a64a65a61 b;
R a64a65a61 R a41a66a42 ra43
else break
while true
return R
Figure 3: The RULE-SELECT algorithm
coreference scoring function that is the same as the
one being used for evaluating the final output of the
system.13 At each iteration, RULE-SELECT greed-
ily discards the rule whose removal yields a rule-
set with which the coreference system performs the
best (with respect to the coreference scoring func-
tion) on the pruning corpus. As a hill-climbing pro-
cedure, the algorithm terminates when removal of
any of the rules in the ruleset fails to improve per-
formance. In contrast to most existing algorithms for
coreference resolution, RULE-SELECT establishes
a tighter connection between the classification- and
clustering-level decisions for coreference resolution
and ensures that system performance is optimized
with respect to the coreference scoring function. We
hypothesize that this optimization of the coreference
classifier will improve performance of the resulting
coreference system, in particular by increasing its
precision.
Evaluation and Discussion. Results are shown in
row 5 of Table 2. In the Pruning experiment, the
MUC-7 formal evaluation corpus is the pruning cor-
pus for the MUC-6 run; the MUC-6 formal evalu-
ation corpus is the pruning corpus for the MUC-7
13Importantly, RULE-SELECT assumes no knowledge of the
inner workings of the scoring function.
run. In addition, the quantity that RULE-SELECT
optimizes for a given ruleset is the F-measure re-
turned by the MUC scoring function.14 In compar-
ison to the Combined results, we see an improve-
ment of 0.2% (for MUC-6) and 6.2% (for MUC-7)
in F-measure. In particular, we see statistically sig-
nificant gains in precision (from 55.1 to 73.6) and
statistically significant, but much smaller, drops in
recall (from 59.5 to 54.2) for the MUC-7 data set.
In general, our results support the hypothesis that
rule pruning can be used to improve system perfor-
mance; moreover, the technique is especially effec-
tive at enhancing the precision of the system. How-
ever, performance gains may be negligible when
pruning is used in systems with high precision, as
can be seen from the results for the MUC-6 data set.
To determine whether performance improvements
are instead attributable to the availability of addi-
tional “training” data provided by the pruning cor-
pus, we train a classifier (using the same setting as
the Combined experiments) on both the training and
the pruning corpora. The performance of the system
using this unpruned ruleset is shown in the last row
of Table 2. In comparison to the Combined results,
F-measure drops from 69.3 to 67.6 (for MUC-6),
and rises from 57.2 and 57.8 (for MUC-7). These re-
sults indicate that the RULE-SELECT algorithm has
made a more effective use of the additional data than
the learning algorithm without rule pruning by ex-
ploiting the feedback provided by the scoring func-
tion.
6 Conclusions
We have examined three problems with recasting
noun phrase coreference resolution as a classifica-
tion task. To handle these problems, we presented
a minimalist negative sample selection algorithm to
reduce the skewness of the class distributions, and
an automatic positive sample selection algorithm to
select easy positive instances. In addition, our ex-
periments indicate that the positive sample selection
algorithm does not guarantee that hard instances can
be entirely excluded. As a result, we proposed an
error-driven rule pruning algorithm that can effec-
tively enhance the precision of the system by dis-
14RULE-SELECT can be used in conjunction with any coref-
erence scoring function. The MUC scorer is chosen here to fa-
cilitate comparison with previous results.
carding rules that cause the ruleset to perform poorly
with respect to the coreference scoring function.
The resulting system outperformed the best MUC-
6 and MUC-7 coreference systems as well as the
best-performing learning-based system on the cor-
responding MUC data sets. Nevertheless, there is
substantial room for improvement. For example, it
is important to know how sensitive system perfor-
mance is with respect to the size of the pruning cor-
pus. In addition, although we use RIPPER as the un-
derlying learning algorithm in our coreference sys-
tem, we expect that the techniques described in this
paper can be used in conjunction with other learn-
ing algorithms. We plan to explore this possibility
in future work.
Acknowledgments
We thank two anonymous reviewers for their help-
ful comments and Hwee Tou Ng for explaining the
method of training instance selection employed by
his coreference system. This work was supported
in part by DARPA TIDES contract N66001-00-C-
8009, and NSF Grants 0081334 and 0074896.

References
C. Aone and S. W. Bennett. 1995. Evaluating Auto-
mated and Manual Acquisition of Anaphora Resolu-
tion Strategies. In Proceedings of the 33rd Annual
Meeting of the Association for Computational Linguis-
tics, pages 122–129.
A. Blum and P. Langley. 1997. Selection of relevant
features and examples in machine learning. Artificial
Intelligence, pages 245–271.
C. Cardie and N. Howe. 1997. Improving minority class
prediction using case-specific feature weights. In Pro-
ceedings of the Fourteenth International Conference
on Machine Learning, pages 57–65.
C. Cardie and K. Wagstaff. 1999. Noun Phrase Coref-
erence as Clustering. In Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing
and Very Large Corpora (EMNLP/VLC-99), pages 82–
89.
W. Cohen. 1995. Fast Effective Rule Induction. In Pro-
ceedings of the Twelfth International Conference on
Machine Learning, San Francisco, CA.
A. Dempster. 1968. A generalization of Bayesian infer-
ence. Journal of the Royal Statistical Society, 30:205–
247.
T. Fawcett. 1996. Learning with skewed class distri-
butions — summary of responses. Machine Learning
List: Vol. 8, No. 20.
D. F. Gordon and D. Perlis. 1989. Explicitly biased gen-
eralization. Computational Intelligence, 5:67–81.
S. Harabagiu, R. Bunescu, and S. Maiorano. 2001. Text
and Knowledge Mining for Coreference Resolution.
In Proceedings of the Second Meeting of the North
America Chapter of the Association for Computational
Linguistics (NAACL-2001), pages 55–62.
A. Kehler. 1997. Probabilistic Coreference in Informa-
tion Extraction. In Proceedings of the Second Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 163–173.
M. Kubat and S. Matwin. 1997. Addressing the Curse
of Imbalanced Training Sets: One-Sided Selection. In
Proceedings of the 14th International Conference on
Machine Learning (ICML-97), pages 179–186.
J. McCarthy and W. Lehnert. 1995. Using Decision
Trees for Coreference Resolution. In Proceedings of
the Fourteenth International Conference on Artificial
Intelligence, pages 1050–1055.
MUC-6. 1995. Proceedings of the Sixth Message Under-
standing Conference (MUC-6). Morgan Kaufmann,
San Francisco, CA.
MUC-7. 1998. Proceedings of the Seventh Message
Understanding Conference (MUC-7). Morgan Kauf-
mann, San Francisco, CA.
V. Ng and C. Cardie. 2002. Improving machine learn-
ing approaches to coreference resolution. In Proceed-
ings of the 40th Annual Meeting of the Association for
Computational Linguistics. Association for Computa-
tional Linguistics.
M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and
C. Brunk. 1994. Reducing Misclassification Costs. In
Proceedings of the Eleventh International Conference
on Machine Learning, pages 217–225.
W. M. Soon, H. T. Ng, and D. C. Y. Lim. 2001. A
Machine Learning Approach to Coreference Resolu-
tion of Noun Phrases. Computational Linguistics,
27(4):521–544.
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, and
L. Hirschman. 1995. A model-theoretic coreference
scoring scheme. In Proceedings of the Sixth Mes-
sage Understanding Conference (MUC-6), pages 45–
52. Morgan Kaufmann.
