Noun Phrase Coreference as Clustering 
Claire Cardie and Kiri Wagstaff 
Department of Computer Science 
Cornell University 
Ithaca, NY 14853 
E-mail: cardie,wkiri@cs.cornell.edu 
Abstract 
This paper introduces a new, unsupervised algo- 
rithm for noun phrase coreference resolution. It dif- 
fers from existing methods in that it views corer- 
erence resolution as a clustering task. In an eval- 
uation on the MUC-6 coreference resolution cor- 
pus, the algorithm achieves an F-measure of 53.6%~ 
placing it firmly between the worst (40%) and best 
(65%) systems in the MUC-6 evaluation. More im- 
portantly, the clustering approach outperforms the 
only MUC-6 system to treat coreference resolution 
as a learning problem. The clustering algorithm ap- 
pears to provide a flexible mechanism for coordi- 
nating the application of context-independent and 
context-dependent constraints and preferences for 
accurate partitioning of noun phrases into corefer- 
ence equivalence classes. 
1 Introduction 
Many natural language processing (NLP) applica- 
tions require accurate noun phrase coreference reso- 
lution: They require a means for determining which 
noun phrases in a text or dialogue refer to the same 
real-world entity. The vast majority of algorithms 
for noun phrase coreference combine syntactic and, 
less often, semantic cues via a set of hand-crafted 
heuristics and filters. All but one system in the 
MUC-6 coreference performance evaluation (MUC, 
1995), for example, handled coreference resolution 
in this manner. This same reliance on complicated 
hand-crafted algorithms is true even for the narrower 
task of pronoun resolution. Some exceptions exist, 
however. Ge et al. (1998) present a probabilistic 
model for pronoun resolution trained on a small sub- 
set of the Penn Treebank Wall Street Journal corpus 
(Marcus et al., 1993). Dagan and Itai (1991) develop 
a statistical filter for resolution of the pronoun "it" 
that selects among syntactically viable antecedents 
based on relevant subject-verb-object cooccurrences. 
Aone and Bennett (1995) and McCarthy and Lehn- 
ert (1995) employ decision tree algorithms to handle 
a broader subset of general noun phrase coreference 
problems. 
This paper presents a new corpus-based approach 
to noun phrase coreference. We believe that it 
is the first such unsupervised technique developed 
for the general noun phrase coreference task. In 
short, we view the task of noun phrase coreference 
resolution as a clustering task. First, each noun 
phrase in a document is represented as a vector 
of attribute-value pairs. Given the feature vector 
for each noun phrase, the clustering algorithm coor- 
dinates the application of context-independent and 
context-dependent coreference constraints and pref- 
erences to partition the noun phrases into equiv- 
alence classes, one class for each real-world entity 
mentioned in the text. Context-independent corefer- 
ence constraints and preferences are those that apply 
to two noun phrases in isolation. Context-dependent 
coreference decisions, on the other hand, consider 
the relationship of each noun phrase to surrounding 
noun phrases. 
In an evaluation on the MUC-6 coreference reso- 
lution corpus, our clustering approach achieves an 
F-measure of 53.6%, placing it firmly between the 
worst (40%) and best (65%) systems in the MUC- 
6 evaluation. More importantly, the clustering ap- 
proach outperforms the only MUC-6 system to view 
coreference resolution as a learning problem: The 
RESOLVE system (McCarthy and Lehnert, 1995) 
employs decision tree induction and achieves an F- 
measure of 47% on the MUC-6 data set. Further- 
more, our approach has a number of important 
advantages over existing learning and non-learning 
methods for coreference resolution: 
• The approach is largely unsupervised, so no an- 
notated training corpus is required. 
• Although evaluated in an information ex- 
traction context, the approach is domain- 
independent. 
• As noted above, the clustering approach pro- 
vides a flexible mechanism for coordinat- 
ing context-independent and context-dependent 
coreference constraints and preferences for par- 
titioning noun phrases into coreference equiva- 
lence classes. 
82 
! 
As a result, we believe that viewing noun phrase 
coreference as clustering provides a promising frame- 
work for corpus-based coreference resolution. 
The remainder of the paper describes the details of 
our approach. The next section provides a concrete 
specification of the noun phrase coreference resolu- 
tion task. Section 3 presents the clustering algo- 
rithm. Evaluation of the approach appears in Sec- 
tion 4. Qualitative and quantitative comparisons to 
related work are included in Section 5. 
2 Noun Phrase Coreference 
It is commonly observed that a human speaker or 
author avoids repetition by using a variety of noun 
phrases to refer to, the same entity. While human 
audiences have little trouble mapping a collection 
of noun phrases onto the same entity, this task of 
noun phrase (NP) coreference resolution can present 
a formidable challenge to an NLP system. Fig- 
ure I depicts a typical coreference resolution system, 
which takes as input an arbitrary document and pro- 
duces as output the appropriate coreference equiva- 
lence classes. The subscripted noun phrases in the 
sample output constitute two noun phrase corefer- 
ence equivalence classes: Class JS contains the five 
noun phrases that refer to John Simon, and class 
PC contains the two noun phrases that represent 
Prime Corp. The figure also visually links neigh- 
boring coreferent noun phrases. The remaining (un- 
bracketed) noun phrases have no coreferent NPs and 
are considered singleton equivalence classes. Han- 
dling the JS class alone requires recognizing corefer- 
ent NPs in appositive and genitive constructions as 
well as those that occur as proper names, possessive 
pronouns, and definite NPs. 
3 Coreference as Clustering 
Our approach to the coreference task stems from 
the observation that each group of coreferent noun 
phrases defines an equivalence class 1. Therefore, it 
is natural to view the problem as one of partitioning, 
or clustering, the noun phrases. Intuitively, all of the 
noun phrases used to describe a specific concept will 
be "near" or related in some way, i.e. their concep- 
tual "distance" will be small. Given a description 
of each noun phrase and a method for measuring 
the distance between two noun phrases, a cluster- 
ing algorithm can then group noun phrases together: 
Noun phrases with distance greater than a cluster- 
ing radius r are not placed into the same partition 
and so are not considered coreferent. 
The subsections below describe the noun phrase 
1The coreference 'relation is symmetric, transitive, and 
reflexive. 
John Simon, Chief Financial Officer of 
Prime Corp. since 1986, saw his pay jump 
20%, to $1.3 million, as the 37-year-old also 
became the financial-services company's 
president. 
\[Js John Simon\], \[Js Chief Financial Officer\]~,~ 
of\[R: Prime Corp.\] since 1986, s g..x~i,\[j s fiis\]..J 
pay jump 20 ,~, a~..\[js ~e 37 - ( 
year-old\] also became \[pc ~l- 
services company\]'s \[Js president\]. 
Figure 1: Coreference System 
representation, the distance metric, and the cluster- 
ing algorithm in turn. 
3.1 Instance Representation 
Given an input text, we first use the Empire noun 
phrase finder (Cardie and Pierce, 1998) to locate 
all noun phrases in the text. Note that Empire 
identifies only base noun phrases, i.e. simple noun 
phrases that contain no other smaller noun phrases 
within them. For example, Chief Financial Officer 
of Prime Corp. is too complex to be a base noun 
phrase. It contains two base noun phrases: Chief 
Financial Officer and Prime Corp. 
Each noun phrase in the input text is then repre- 
sented as a set of 11 features as shown in Table 1. 
This noun phrase representation is a first approxi- 
mation to the feature vector that would be required 
for accurate coreference resolution. All feature val- 
ues are automatically generated and, therefore, are 
not always perfect. In particular, we use very sim- 
ple heuristics to approximate the behavior of more 
complex feature value computations: 
Individual Words. The words contained in the 
noun phrase are stored as a feature. 
Head noun. The last word in the noun phrase is 
considered the head noun. 
Position. Noun phrases are numbered sequentially, 
starting at the beginning of the document. 
Pronoun Type. Pronouns are marked as one of 
NOMinative, ACCusative, POSSessive, or AMBigUOUS 
(you and it). All other noun phrases obtain the value 
83 
Words. Head Noun 
(in bold) 
John Simon 
Chief Financial 
Officer 
Prime Corp. 
1986 
his 
pay 
20% 
$1.3 million 
the 37-year-old 
the financial-services 
company 
president 
Posi- 
tion 
1 
2 
Pronoun 
Type 
NONE 
NONE 
3 NONE 
4 NONE 
5 POSS 
6 NONE 
7 NONE 
8 NONE 
9 NONE 
10 NONE 
11 NONE 
Article 
NONE 
NONE 
Appos- 
itive 
NO 
NO 
NONE NO 
NONE NO 
NONE NO 
NONE NO 
NONE NO 
NONE NO 
DEF NO 
DEF NO 
NONE NO 
Number Proper Semantic 
Name Class 
SING YES HUMAN 
SING NO HUMAN 
SING NO COMPANY 
PLURAL NO NUMBER 
SING NO HUMAN 
SING NO PAYMENT 
PLURAL NO PERCENT 
PLURAL NO MONEY 
SING NO HUMAN 
SING NO COMPANY 
SING NO HUMAN 
Gender 
MASC 
EITHER 
NEUTER 
NEUTER 
MASC 
NEUTER 
NEUTER 
NEUTER 
EITHER 
NEUTER 
EITHER 
Animacy 
ANIM 
ANIM 
INANIM 
INANIM 
ANIM 
INANIM 
INANIM 
INANIM 
ANIM 
INANIM 
ANIM 
Table 1: Noun Phrase Instance Representation For All Base NPs in the Sample Text 
NONE for this feature. 
Article. Each noun phrase is marked INDEFinite 
(contains a or an), DEFinite (contains the), or NONE. 
Appositive. Here we use a simple, overly restric- 
tive heuristic to determine whether or not the noun 
phrase is in a (post-posed) appositive construction: 
If the noun phrase is surrounded by commas, con- 
tains an article, and is immediately preceded by an- 
other noun phrase, then it is marked as an apposi- 
tive. 
Number. If the head noun ends in an 's', then 
the noun phrase is marked PLURAL; otherwise, it is 
considered SINGular. Expressions denoting money, 
numbers, or percentages are also marked as PLURAL. 
Proper Name. Proper names are identified by 
looking for two adjacent capitalized words, option- 
ally containing a middle initial. 
Semantic Class. Here we use WordNet (Fellbaum, 
1998) to obtain coarse semantic information for the 
head noun. The head noun is characterized as one 
Of TIME, CITY, ANIMAL, HUMAN, or OBJECT. If none 
of these classes pertains to the head noun, its imme- 
diate parent in the class hierarchy is returned as the 
semantic class, e.g. PAYMENT for the head noun pay 
in NP6 of Table 1. A separate algorithm identifies 
NUMBERS, MONEY, and COMPANYs. 
Gender. Gender (MASCUline, FEMinine, EITHER, 
or NEUTER) is determined using WordNet and (for 
proper names) a list of common first names. 
Animacy. Noun phrases classified as HUMAN or AN- 
IMAL are marked ANIM; all other NPs are considered 
INANIM. 
3.2 Distance Metric 
Next, we define the following distance metric be- 
tween two noun phrases: 
dist(NPi, NPj) ---- 
~ f e F W f * incompatibility f ( N P i, N P y ) 
where F corresponds to the NP feature set de- 
scribed above; incompatibilityf is a function that 
returns a value between 0 and l inclusive and in- 
dicates the degree of incompatibility of f for NPi 
and NPj; and wf denotes the relative importance 
of compatibility w.r.t, feature f. The incompatibil- 
ity functions and corresponding weights are listed in 
Table 2. 2 In general, weights are chosen to represent 
linguistic knowledge about coreference. Terms with 
a weight of cc represent filters that rule out impossi- 
ble antecedents: Two noun phrases can never corefer 
when they have incompatible values for that term's 
feature. In the current version of our system, the 
NUMBER, PROPER NAME, SEMANTIC CLASS, GEN- 
DER, and ANIMACY features operate as coreference 
filters. Conversely, terms with a weight of -c~ force 
coreference between two noun phrases with compat- 
ible values for that term's feature. The APPOSITIVE 
and WORDS-SUBSTRING terms operate in this fash- 
ion in the current distance metric. 
Terms with a weight of r -- the clustering ra- 
dius threshold -- implement a preference that two 
NPs not be coreferent if they are incompatible w.r.t. 
that term's feature. As will be explained below, 
however, two such NPs can be merged into the 
same equivalence class by the clustering algorithm 
if there is enough other evidence that they are sim- 
ilar (i.e. there are other, coreferent noun phrase(s) 
sufficiently close to both). 
All other terms obtain weights selected using the 
development corpus. Although additional testing 
2Note that there is not currently a one-to-one correspon- 
dence between NP features and distance metric terms: The 
distance metric contains two terms that make use of the 
WORDS feature of the noun phrase representation. 
84 
Feature f 
Words 
Head Noun 
Position 
Pronoun 
Article 
Words-Substring 
Appositive 
Number 
Proper Name 
Semantic Class 
Gender 
Animacy 
Weight Incompatibility function 
10.0 (~ of mismatching words a) / (~- of words in longer NP) 
1.0 1 if the head nouns differ; else 0 
5.0 (difference in position) / (maximum difference in document) 
r 1 if NP~ is a pronoun and NPj is not; else 0 
r 1 if NPj is indefinite and not appositive; else 0 
-~ 1 if NPi subsumes (entirely includes as a substring) NPj; 
-~ 1 if NPj is appositive and NPi is its immediate predecessor; else 0 
o¢ 1 if they do not match in number; else 0 
co 1 if both are proper names, but mismatch on every word; else 0 
1 if they do not match in class; else 0 
c~ 1 if they do not match in gender (allows EITHER to match MASC or FEM); else 0 
c~ 1 if they do not match in animacy; else 0 
aPronouns are handled as gender-specific "wild cards". 
Table 2: Incompatibility Functions and Weights for Each Term in the Distance Metric 
Words, Head: Posi- Pronoun Article Appos- Number Proper Class I Gender Animacy 
Noun tion tive Name 
The chairman 1 NONE DEF NO SING NO HUMAN EITHER ANIM 
Ms. White 2 NONE NONE NO SING YES HUMAN FEM ANIM 
He 3 NOM NONE NO SING NO HUMAN MASC ANIM 
Table 3: Instance Representation for Noun Phrases in The chairman spoke with Ms. White yesterday. He... 
is required, our current results indicate that these 
weights are sensitive to the distance metric, but 
probably not to the corpus. 
When computing a sum that involves both oc and 
-c~, we choose~the more conservative route, and 
the oc distance takes precedence (i.e. the two noun 
phrases are not considered coreferent). An example 
of where this might occur is in the following sentence: 
\[1 Reardon Steel Co.\] manufactures several 
thousand tons of \[2 steed each week. 
Here, NP1 subsumes NP2, giving.them a distance 
of -c<) via the word substring term; however, NPi's 
semantic class is COMPANY, and NP2's class is OB- 
JECT, generating a distance of cx) via the semantic 
class feature. Therefore, dist(NP1,NP2) = oc and 
the two noun phrases are not considered coreferent. 
The coreference distance metric is largely context- 
independent in {hat it determines the distance be- 
tween two noun ~ phrases using very little, if any, of 
their intervening or surrounding context. The clus- 
tering algorithm described below is responsible for 
coordinating these local coreference decisions across 
arbitrarily long contexts and, thus, implements a se- 
ries of context-dependent coreference decisions. 
3.3 Clustering Algorithm 
The clustering algorithm is given in Figure 2. Be- 
cause noun phrases generally refer to noun phrases 
that precede them, we start at the end of the doc- 
ument and work backwards. Each noun phrase is 
compared to all preceding noun phrases. If the dis- 
tance between two noun phrases is less than the 
clustering radius r, then their classes are considered 
for possible merging. Two coreference equivalence 
classes can be merged unless there exist any incom- 
patible NPs in the classes to be merged. 
It is useful to consider the application of our al- 
gorithm to an excerpt from a document: 
\[1 The chairman\] spoke with \[2 Ms. White\] 
yesterday. \[~ He\] ... 
The noun phrase instances for this fragment are 
shown in Table 3. Initially, NP1, NP2, and NP3 
are all singletons and belong to coreference classes 
cl, c2, and c3, respectively. We begin by consid- 
ering NP3. Dist(NP2,NP3) = oc due to a mis- 
match on gender, so they are not considered for 
possible merging. Next, we calculate the distance 
from NP1 to NP3. Pronouns are not expected to 
match when the words of two noun phrases are com- 
pared, so there is no penalty here for word (or head 
noun) mismatches. The penalty for their difference 
in position is dependent on the length of the docu- 
ment. For illustration, assume that this is less than 
r. Thus, dist(NP1,NP3) < r. Their coreference 
classes, Cl and c3, are then considered for merging. 
Because they are singleton classes, there is no addi- 
tional possibility for conflict, and both noun phrases 
are merged into cl. 
85 
COREFERENCE_CLUSTERING ( N P~ , N P,-1 ..... N Pi ) 
1. Let r be the clustering radius. 
2. Mark each noun phrase NPi as belonging to its own 
class, c~: c, = {NP~}. 
3. Proceed through the noun phrases from the docu- 
ment in reverse order, NP~. NP,.-1 ..... NP1. For 
each noun phrase NPj encountered, consider each 
preceding noun phrase NPi. 
(a) Let d = dist(NF~, NP i). 
(b) Let c~ = class_ofNPi and cj = class_ofNPj. 
(c) If d < r and ALL_NPs_COMPATIBLE (ei, cj) 
then cj = c~ O cj. 
ALL_NPs_COMPATIBLE (c~, c/) 
1. For all NP~ E cj 
(a) For all NPb E c~ 
i. If dist(NPa, NPb) = co 
then Return FALSE. 
2. Return TRUE. 
Figure 2: Clustering Algorithm 
The algorithm then considers NP2. 
Dist(NPi,NP2) = 11.0 plus a small penalty 
for their difference in position. If this distance is 
> r, they will not be considered coreferent, and 
the resulting equivalence classes will be: {{The 
chairman, he}, {Ms. White}}. Otherwise, the 
distance is < r, and the algorithm considers cl and 
c~ for merging. However, cl contains NP3, and, as 
calculated above, the distance from NP2 to NP3 is 
oc. This incompatibility prevents the merging of c~ 
and c2, so the resulting equivalence classes would 
still be {{The chairman, he}, {Ms. White}}. 
In this way, the equivalence classes grow in a 
flexible manner. In particular, the clustering al- 
gorithm automatically computes the transitive clo- 
sure of the coreference relation. For instance, if 
dist(NPi,NPj) < r and dist(NPj,NPk) < r 
then (assuming no incompatible NPs), NPi, NPj, 
and NPk will be in the same class and consid- 
ered mutually coreferent. In fact, it is possible that 
dist(NPi, NPk) > r, according to the distance mea- 
sure; but as long as that distance is not c~, NPi 
can be in the same class as NPk. The distance 
measure operates on two noun phrases in isolation, 
but the clustering algorithm can and does make use 
of intervening NP information: intervening noun 
phrases can form a chain that links otherwise dis- 
tant NPs. By separating context-independent and 
context-dependent computations, the noun phrase 
representation and distance metric can remain fairly 
simple and easily extensible as additional knowledge 
sources are made available to the NLP system for 
coreference resolution. 
4 Evaluation 
We developed and evaluated the clustering ap- 
proach to coreference resolution using the "dry run" 
and "formal evaluation" MUC-6 coreference cot- 
pora. Each corpus contains 30 documents that have 
been annotated with NP coreference links. We used 
the dryrun data for development of the distance 
measure and selection of the clustering radius r and 
reserved the formal evaluation materials for testing. 
All results are reported using the standard men- 
sures of recall and precision or F-measure (which 
combines recall and precision equally). They were 
calculated automatically using the MUC-6 scoring 
program (Vilaln et al., 1995). 
Table 4 summarizes our results and compares 
them to three baselines. For each algorithm, we 
show the F-measure for the dryrun evaluation (col- 
umn 2) and the formal evaluation (column 4). (The 
"adjusted" results are described below.) For the 
dryrun data set, the clustering algorithm obtains 
48.8% recall and 57.4% precision. The formal eval- 
uation produces similar scores: 52.7% recall and 
54.6% precision. Both runs use r = 4, which was ob- 
tained by testing different values on the dryrun cor- 
pus. Table 5 summarizes the results on the dryrun 
data set for r values from 1.0 to 10.0. 3 As expected, 
increasing r also increases recall, but decreases pre- 
cision. Subsequent tests with different values for r 
on the formal evaluation data set also obtained op- 
timal performance with r= 4. This provides partial 
support for our hypothesis that r need not be recal- 
culated for new corpora. 
The remaining rows in Table 4 show the perfor- 
mance of the three baseline algorithms. The first 
baseline marks every pair of noun phrases as coref- 
erent, i.e. all noun phrases in the document form one 
class. This baseline is useful because it establishes 
an upper bound for recall on our clustering algo- 
rithm (67% for the dryrun and 69% for the formal 
evaluation). The second baseline marks as corefer- 
ent any two noun phrases that have a word in com- 
mon. The third baseline marks as coreferent any two 
noun phrases whose head nouns match. Although 
the baselines perform better one might expect (they 
outperform one MUC-6 system), the clustering al- 
gorithm performs significantly better. 
In part because we rely on base noun phrases, our 
aNote that r need not be an integers especially when the 
distance metric is returning non-integral values. 
86 
I 
Algorithm \] Dryrun Data Set 
Official Adjusted 
Clustering 52.8 64.9 
All One Class 44.8 50.2 
Match Any Word 44.1 52.8 
Match Head Noun 46.5 56.9 
Formal Run Data Set 
Official Adjusted 
53.6 63.5 
41.5 45.7 
41.3 48.8 
45.7 54.9 
Table 4: F-measure Results for the Clustering Algorithm and Baseline Systems on the MUC-6 Data Sets 
recall levels are fairly low. The "adjusted" figures 
of Table 4 reflect this upper bound on recall. Con- 
sidering only coreference links between base noun 
phrases, the clustering algorithm obtains a recall of 
72.4% on the dryrun, and 75.9% on the formal eval- 
uation. Another Source of error is inaccurate and 
inadequate NP feature vectors. Our procedure for 
computing semantic class values, for example, is re- 
sponsible for many errors -- it sometimes returns 
incorrect values and the coarse semantic class dis- 
tinctions are often inadequate. Without a better 
named entity finder, computing feature vectors for 
proper nouns is difficult. Other errors result from a 
lack of thematic and grammatical role information. 
The lack of discourse-related topic and focus infor- 
mation also limits System performance. In addition, 
we currently make no special attempt to handle re- 
flexive pronouns and pleonastic "it". 
Lastly, errors arise from the greedy nature of the 
clustering algorithm. Noun phrase NPj is linked 
to every preceding noun phrase NP~ that is com- 
patible and within the radius r, and that link can 
never be undone. We are considering three possible 
ways to make the algorithm less aggressively greedy. 
First, for each NPj, instead of considering every 
previous noun phrase, the algorithm could stop on 
finding the first compatible antecedent. Second, for 
each NPj, the algorithm could rank all possible an- 
tecedents and then choose the best one and link only 
to that one. Lastly,: the algorithm could rank all pos- 
sible coreference links (all pairs of noun phrases in 
the document) and then proceed through them in 
ranked order, thus progressing from the links it is 
most confident about to those it is less certain of. 
Future work will include a more detailed error anal- 
ysis. 
5 Related Work 
Existing systems for noun phrase coreference reso- 
lution can be broadly characterized as learning and 
non-learning approaches. All previous attempts to 
view coreference as a learning problem treat coref- 
erence resolution as a classification task: the algo- 
rithms classify a pair of noun phrases as coreferent 
or not. Both MLR (Aone and Bennett, 1995) and 
RESOLVE (McCarthy and Lehnert, 1995), for ex- 
r Recall 
1 34.6 
2 44.7 
3 47.3 
4 48.8 
5 49.1 
6 49.8 
7 50.3 
8 50.7 
9 50.9 
10 50.9 
Precision F-measure 
69.3 46.1 
61.4 51.7 
58.5 52.3 
57.4 52.8 
56.8 52.7 
55.0 52.3 
53.8 52.0 
53.0 51.8 
52.5 51.7 
52.1 51.5 
Table 5: Performance on the Dryrun Data Set for 
Different r 
ample, apply the C4.5 decision tree induction al- 
gorithm (Quinlan, 1992) to the task. As super- 
vised learning algorithms, both systems require a 
fairly large amount of training data that has been 
annotated with coreference resolution information. 
Our approach, on the other hand, uses unsuper- 
vised learning 4 and requires no training data. 5 In 
addition, both MLR and RESOLVE require an ad- 
ditional mechanism to coordinate the collection of 
pairwise coreference decisions. Without this mech- 
anism, it is possible that the decision tree classifies 
NPi and NPj as coreferent, and NPj and NPk as 
coreferent, but NPi and NPk as not coreferent. In 
an evaluation on the MUC-6 data set (see Table 6), 
RESOLVE achieves an F-measure of 47%. 
The MUC-6 evaluation also provided results for a 
large number of non-learning approaches to corefer- 
ence resolution. Table 6 provides a comparison of 
our results to the best and worst of these systems. 
Most implemented a series of linguistic constraints 
similar in spirit to those employed in our system. 
The main advantage of our approach is that all con- 
straints and preferences are represented neatly in 
the distance metric (and radius r), allowing for sim- 
ple modification of this measure to incorporate new 
4Whether or not clustering can be considered a "learning" 
approach is unclear. The algorithm uses the existing parti- 
tions to process each successive NP, but the partitions gener- 
ated for a document are not useful for processing subsequent 
documents. 
OWe do use training data to tune r, but as noted above, it 
is likely that r need not be recalculated for new corpora. 
87 
Algorithm Recall 
Clustering 53 
RESOLVE 44 
Best MUC-6 59 
Worst MUC-6 36 
Precision F-measure 
55 54 
51 ! 47 
72 65 
44 40 
Table 6: Results on the MUC-6 Formal Evaluation 
knowledge sources. In addition, we anticipate being 
able to automatically learn the weights used in the 
distance metric. 
There is also a growing body of work on the nar- 
rower task of pronoun resolution. Azzam et al. 
(1998), for example, describe a focus-based approach 
that incorporates discourse information when re- 
solving pronouns. Lappin and Leass (1994) make 
use of a series of filters to rule out impossible an- 
tecedents, many of which are similar to our oo- 
incompatibilities. They also make use of more exten- 
sive syntactic information (such as the thematic role 
each noun phrase plays), and thus require a fuller 
parse of the input text. Ge et al. (1998) present 
a supervised probabilistic algorithm that assumes a 
full parse of the input text. Dagan and Itai (1991) 
present a hybrid full-parse/unsupervised learning 
approach that focuses on resolving "it". Despite a 
large corpus (150 million words), their approach suf- 
fers from sparse data problems, but works well when 
enough relevant data is available. Lastly, Cardie 
(1992a; 1992b) presents a case-based learning ap- 
proach for relative pronoun disambiguation. 
Our clustering approach differs from this previous 
work in several ways. First, because we only require 
the noun phrases in any input text, we do not require 
a fifll syntactic parse. Although we would expect in- 
creases in performance if complex noun phrases were 
used, our restriction to base NPs does not reflect a 
limitation of the clustering algorithm (or the dis- 
tance metric), but rather a self-imposed limitation 
on the preprocessing requirements of the approach. 
Second, our approach is unsupervised and requires 
no annotation of training data, nor a large corpus for 
computing statistical occurrences. Finally, we han- 
dle a wide array of noun phrase coreference, beyond 
just pronoun resolution. 
6 Conclusions and Future Work 
We have presented a new approach to noun phrase 
coreference resolution that treats the problem as 
a clustering task. In an evaluation on the MUC- 
6 coreference resolution data set, the approach 
achieves very promising results, outperforming the 
only other corpus-based learning approach and pro- 
ducing recall and precision scores that place it firmly 
between the best and worst coreference systems in 
the evaluation. In contrast to other approaches to 
coreference resolution, ours is unsupervised and of- 
fers several potential advantages over existing meth- 
ods: no annotated training data is required, the dis- 
tance metric can be easily extended to account for 
additional linguistic information as it becomes avail- 
able to the NLP system, and the clustering approach 
provides a flexible mechanism for combining a vari- 
ety of constraints and preferences to impose a parti- 
tioning on the noun phrases in a text into coreference 
equivalence classes. 
Nevertheless, the approach can be improved in a 
number of ways. Additional analysis and evaluation 
on new corpora are required to determine the gen- 
erality of the approach. Our current distance met- 
ric and noun phrase instance representation are only 
first, and admittedly very coarse, approximations to 
those ultimately required for handling the wide va- 
riety of anaphoric expressions that comprise noun 
phrase coreference. We would also like to make use 
of cues from centering theory and plan to explore the 
possibility of learning the weights associated with 
each term in the distance metric. Our methods for 
producing the noun phrase feature vector are also 
overly simplistic. Nevertheless, the relatively strong 
performance of the technique indicates that cluster- 
ing constitutes a powerful and natural approach to 
noun phrase coreference resolution. 
7 Acknowledgments 
This work was supported in part by NSF Grant IRI- 
9624639 and a National Science Foundation Graduate 
fellowship. We would like to thank David Pierce for his 
formatting and technical advice. 

References 

Chinatsu Aone and William Bennett. 1995. Eval- 
uating Automated and Manual Acquisition of 
Anaphora Resolution Strategies. In Proceedings of 
the 33rd Annual Meeting of the ACL, pages 122- 
129. Association for Computational Linguistics. 

S. Azzam, K. Humphreys, and R. Gaizauskas. 1998. 
Evaluating a Focus-Based Approach to Anaphora 
Resolution. In Proceedings of the 36th Annual 
Meeting of the ACL and COLING-98, pages 74- 
78. Association for Computational Linguistics. 

C. Cardie and D. Pierce. 1998. Error-Driven Prun- 
ing of Treebank Grammars for Base Noun Phrase 
Identification. In Proceedings of the 36th Annual 
Meeting of the ACL and COLING-98, pages 218- 
224. Association for Computational Linguistics. 

C. Cardie. 1992a. Corpus-Based Acquisition of Rel- 
ative Pronoun Disambiguation Heuristics. In Pro- 
ceedings off the 30th Annual Meeting off the ACL, 
pages 216-223, University of Delaware, Newark, 
DE. Association for Computational Linguistics. 

C. Cardie. 1992b. Learning to Disambiguate Rel- 
ative Pronouns. In! Proceedings of the Tenth Na- 
tional Conference on Artificial Intelligence, pages 
38-43, San Jose, CA. AAAI Press / MIT Press. 

I. Dagan and A. Itai. 1991. A Statistical Filter 
for Resolving Pron0un References. In Y. A. Feld- 
man and A. Bruckstein, editors, Artificial Intelli- 
gence and Computer Vision, pages 125-135. Else- 
vier Science Publishers, North Holland. 

C. Fellbaum. 1998. WordNet: An Electronical Lex- 
ical Database. The MIT Press, Cambridge, MA. 

N. Ge, J. Hale, and E\] Charniak. 1998. A Statistical 
Approach to Anaphora Resolution. In Charniak, 
Eugene, editor, Proceedings of the Sixth Workshop 
on Very Large Corpora, pages 161-170, Montreal, 
Canada. ACL SIGDAT. 

S. Lappin and H. Leass. 1994. An Algorithm for 
Pronominal Anaphbra Resolution. Computational 
Linguistics, 20(4):5'35-562. 

M. Marcus, M. Marcinkiewicz, and B. Santorini. 
1993. Building a Large Annotated Corpus of En- 
glish: The Penn Treebank. Computational Lin- 
guistics, 19(2):313-:330. 

J. McCarthy and W. Lehnert. 1995. Using Decision 
Trees for Coreference Resolution. In C. Mellish, 
editor, Proceedings iof the Fourteenth International 
Conference on Art!ficial Intelligence, pages 1050-1055. 

1995. Proceedings of the Sixth Message Understand- 
ing Conference (MUC-6). Morgan Kaufmann, 
San Francisco, CA. 

J. R. Quinlan. 1992. C~.5: Programs for Machine 
Learning. Morgan Kaufmaan, San Mateo, CA. 

M. Vilaim J. Burger, J. Aberdeen, D. Connolly, 
and L. Hirschman. ~1995. A model-theoretic coref- 
erence scoring scheme. In Proceedings of the 
Sixth Message Understanding Conference (MUC- 
6), pages 45-52, San Francisco, CA. Morgan Kauf- 
mann. 
