Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 40–47,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Clustering Aproach for 
Unsupervised Chinese Coreference Resolution 
Chi-shing Wang Grace NGAI 
Department of Computing 
Hong Kong Polytechnic University 
Kowloon, HONG KONG 
{cscswang, csgngai}@comp.polyu.edu.hk 
 
 
Abstract 
Coreference resolution is the process of 
identifying expressions that refer to the same 
entity. This paper presents a clustering algo-
rithm for unsupervised Chinese coreference 
resolution. We investigate why Chinese 
coreference is hard and demonstrate that 
techniques used in coreference resolution for 
English can be extended to Chinese. The 
proposed system exploits clustering as it has 
advantages over traditional classification 
methods, such as the fact that no training 
data is required and it is easily extended to 
accommodate additional features. We con-
duct a set of experiments to investigate how 
noun phrase identification and feature selec-
tion can contribute to coreference resolution 
performance. Our system is evaluated on an 
annotated version of the TDT3 corpus using 
the MUC-7 scorer, and obtains comparable 
performance. We believe that this is the first 
attempt at an unsupervised approach to Chi-
nese noun phrase coreference resolution. 
1 INTRODUCTION 
Noun phrase coreference resolution is the proc-
ess of detecting noun phrases (NPs) in a docu-
ment and determining whether the NPs refer to 
the same entity, where an entity is defined as “a 
construct that represents an abstract identity”. 
The NPs that refer to the entity are known as 
mentions. Mentions can be antecedents or ana-
phors. An anaphor is an expression that refers 
back to a previous expression in a discourse. In 
Figure 1, 4  m <U (President Clinton) refers 
to 4  m  (Clinton) and is described as an ana-
phoric reference to4  m  (Clinton). 4 m <
U  (President Clinton) is described as the antece-
dent of �  (he). 4 m  (Clinton), 4  m<U  
(President Clinton) and the second �  (he) are all 
mentions of the same entity that refers to former 
U.S. president Bil Clinton. 
NP coreference resolution is an important sub-
task in natural language processing (NLP) appli-
cations such as text summarization, information 
extraction, data mining and question answering. 
This task has attracted much attention in recent 
years (Cardie and Wagstaff, 199; Harabagiu et 
al., 201; Soon et al., 201; Ng and Cardie, 
202; Yang et al., 204; Florian et al., 204; 
Zhou et al., 205), and has been included as a 
subtask in the MUC (Message Understanding 
Conferences) and ACE (Automatic Content 
Extraction) competitions. 
Coreference resolution is a difficult task for 
various reasons. Firstly, a list of features can play 
a role to suport coreference resolution such as 
 
Figure 1: An excerpt from the text, with core-
ferring noun phrases annotated. English trans-
lation in italics. 
 <4  m   >�  d6 8 m 
Z � �� � �  <�
7   >Y
� �j?  f  <� } �   >�  <4  m   >Yz  �
[ J f <�   >� j  �  <4  m <U   >�6 �  
a  d  <�   >
: �� � 5  <�
7   >�m ���
� f � 	 
[Clinton
1
] said that Washington would progres-
sively folow through on economic aid to [Ko-
rea
2
]. [Kim Dae-Jung
3
] aplauded [Clinton
1
]’s 
speech. [He
1
] said, “[President Clinton
1
] reiter-
ated in the talks that [he
1
] would provide solid 
suport for [Korea
2
] to shake of the economic 
crisis. 
40
gender agreement, number agreement, head noun 
matches, semantic class, positional information, 
contextual information, appositive, abbreviation 
etc. Ng and Cardie (202) found 53 features 
which are useful for this problem. However, no 
single feature is completely reliable since there 
are always exceptions: e.g. the number agree-
ment test returns false when  x � B  (this 
army, singular) is matched against L~ B �  (army 
members, plural), despite the two phrases being 
coreferential. Secondly, identifying features 
automatically and accurately is hard. Features 
such as semantic class come from named entity 
recognition (NER) systems and ontologies and 
gazetteers, but they are not always accurate, es-
pecially where new terms are concerned. Thirdly, 
coreference resolution subsumes the pronoun 
resolution problem, which is already difficult 
since pronouns carry limited lexical and semantic 
information. 
In addition to the aforementioned, Chinese 
coreference resolution is also made more diffi-
cult due to the lack of morphological and ortho-
graphic clues. Chinese words contain less exte-
rior information than words in many Indoeuro-
pean languages. For example, in English, number 
agreement can be detected through word inflec-
tions and part-of-speech (POS) tags, but there are 
no simple rules in Chinese to distinguish whether 
a word is singular or plural. Proper name and 
abbreviations are identified by capitalization in 
English, but Chinese does not use capitalization. 
Moreover, writen Chinese does not have word 
boundaries, so word segmentation is a crucial 
problem, as we cannot get the true meaning of 
the sentence based on characters alone. A simple 
sentence can be segmented in several different 
ways to get different meanings. This characteris-
tic affects the performance of all parts and leads 
to irrecoverable errors. In addition, there are very 
few Chinese coreference data sets available for 
research purposes (none of them freely available) 
and as a result, no easily obtainable benchmark-
ing dataset for training and measuring perform-
ance. Building a reasonably large coreference 
corpus is a labor-consuming task. 
To our knowledge, there have only been two 
Chinese coreference systems in previously pub-
lished work: Florian et al. (204), which presents 
a statistical framework and reports experiment 
results on Chinese texts; and Zhou et al. (205), 
which proposed a unified transformation based 
learning framework for Chinese entity detection 
and tracking. It consists of two models: the de-
tection model locates posibly coreferring NPs 
and the tracking model links the coreference re-
lations. 
This paper presents research performed on 
Chinese noun phrase coreference resolution. 
Since there are no freely available Chinese 
coreference resources, we used an unsupervised 
method that partially borrows from Cardie and 
Wagstaff’s (199) clustering-based technique, 
with features that are specially designed for Chi-
nese. In addition, we perform and present the 
results of experiments designed to investigate the 
contribution of each feature. 
2 Experiment Setup 
Identifying coreferent NPs in an unanotated 
document actually involves two tasks: mention 
detection, which identifies the anaphors and an-
tecedents in a document, folowed by noun 
phrase coreference resolution. In order to reduce 
the complexity of the final system, we folow the 
usual aproach in handling these two phases 
separately. 
2.1 Corpus 
Even though we are using an unsupervised ap-
proach, a gold standard corpus is stil needed for 
experiment evaluation. Since we did not have 
access to the ACE multilingual entity tracking 
corpus, we created our own corpus by selecting 
30 documents from the TDT3 Chinese corpus. 
This resulted in a corpus of approximately 36K 
Chinese characters, about the same size as the 
MUC dryrun test sets. We then had our corpus 
annotated by a native Chinese speaker folowing 
the MUC-7 (Hirschman and Chinchor, 197) and 
ACE Chinese entity guidelines (LDC, 204) by 
picking out noun phrase mentions corresponding 
to one of the folowing nine types of entities: 
Person, Organization, Location, Geo-Political 
Entity (GPE), Facility, Vehicle, Weapon, Date 
and Money, and for each pair of mentions, decid-
ing whether they refer to the same entity folow-
ing MUC-7 definitions. According to the guide-
lines, each mention participates in exactly one 
entity, and all mentions in the same entity are 
coreferent. The NPs that are marked include 
proper nouns, nominal nouns and pronouns and 
the entity types are a superset of those used in the 
MUC and ACE competitions. The resulting cor-
pus includes 1640 mentions, referring to 410 en-
tities. 
Once our corpus had been determined, the first 
step was to determine the posible mentions in a 
plain text. We first used a dictionary-based word 
41
segmentation system (Lancashire, 205) to seg-
ment the Chinese characters into words. The 
segmented words are then labeled with POS tags 
by a statistical POS tagging system (Fung et al., 
204).  
3 Mention Detection 
After the corpus has been preprocessed, mention 
detection involves the identification of NPs in 
the corpus that refer to some entity. Most of 
these NPs correspond to non-recursive NPs, 
which makes this task simpler as most syntactic 
parsers identify NPs as part of the parsing proc-
ess. This approach, however, suffers from two 
problems: firstly, the parser itself is unlikely to 
be 10% accurate; and secondly, the boundaries 
of the NPs identified by the parser may not cor-
respond exactly with those of the entities identi-
fied by the human annotator. 
Another approach is simply to use heuristics 
based on the POS tag sequence to identify poten-
tial NPs of interest. The advantage of this 
method is that the NPs thus extracted should be 
closer to the human-annotated entities since the 
heuristics wil be constructed specifically for this 
task. 
To investigate the effect of different ap-
proaches on the result of the coreference resolu-
tion, we applied both methods separately to our 
corpus. The corpus was parsed with a state-of-
the-art multilingual statistical parser (Bikel 
204), which is trained on the Chinese Penn 
Treebank. After parsing, we extracted all non-
recursive NP chunks tagged by the parser as pos-
sible mentions. 
 For the heuristic-based approach, we aplied 
a few simple heuristics, which had been previ-
ously developed during unrelated work for Eng-
lish named-entity resolution (i.e. they were not 
writen with foreknowledge of the gold standard 
entities) and which are based on the part-of-
speech tags of the words. Some examples of our 
heuristics were to lok for pronouns, or to extract 
all noun sequences, or sequences of determiners 
folowed by adjectives and nouns. 
 Table 1 shows the performance of the pars-
ing-based approach versus the heuristic-based 
aproach. The parser-based approach suffers 
mainly because the NPs that it extracts tend to be 
on the long side, resulting in recall errors when 
the boundaries of the parser-identified NPs mis-
match with the human-annotated entities. In ad-
dition, the parser also tends to extract more NPs 
than needed, which results in a hit to precision. 
4 Coreference Resolution 
The final step after the mention detection phase 
is to determine which of the extracted phrases 
refer to the same entity, or are coreferent. 
The small size of our corpus made it quite ob-
vious that we would not be able to perform su-
pervised learning, as there would not be enough 
data for generalization purposes.  Therefore we 
chose to use an unsupervised clustering approach 
for this step. Clustering is a natural choice as it 
partitions the data into groups; used on corefer-
ence resolution, we expect to gather coreferrent 
NPs into the same cluster.  Furthermore, most 
clustering methods can easily incorporate both 
context-dependent and independent constraints 
into their features. 
4.1 Features 
Our features use both lexical and syntactic in-
formation designed to capture both the content of 
the phrase and its role within the sentence. With 
the exception of the last three features, which are 
defined with respect to a noun phrase pair, all our 
features describe various aspects of a single noun 
phrase: 
Lexical String – This is just simply the string of 
words in the phrase. 
Head Noun – The head noun in a phrase is the 
noun that is not a modifier for another noun. 
Sentence Position – This measures the position 
of the phrase within the document. 
Gender – For each phrase, we use a gazetteer to 
asign it a gender. The posible values are male 
(e.g. 
� 
\ , mister), female (e.g. 	� } , mis), ei-
ther (e.g. v � , leader) and neither (e.g. � 	� , 
factory). 
Number – A phrase can be either singular (e.g. 
� 
mI , one cat), plural (e.g. w 
m� , two dogs), 
either (e.g. 5� � , product) or neither (e.g. � 
� , 
safety). 
 Recall Precision F-Measure 
Heuristics 83 59.3 69.2 
Parser-Based 62.7 28.7 39.4 
Table 1: Mention Detection Results 
 
42
Semantic Clas – To give the system more in-
formation on each phrase, we generated our own 
gazetteer from a combination of gazetteers com-
piled from web sources and heuristics. Our gaz-
etteer consists of 470 entries, each of which is 
labeled with one of the folowing semantic 
classes: person, organization, location, facility, 
GPE, date, money, vehicle and weapon. Phrases 
in the corpus that are found in the gazetteer are 
given the same semantic class label; phrases not 
in the gazetteer are marked as UNKNOWN. 
Proper Name – The part-of-speech tag “NR” 
and a list of common proper names were used to 
label each noun phrase as to whether it is a 
proper name (values: true/false). 
Pronoun – Determined by the part-of-speech 
“PN”. Values: true/false. 
Demonstrative Noun Phrase – A demonstrative 
noun phrase is a phrase that consists of a noun 
phrases preceded by one of the characters [ �
� ] (this/that/some). 
Appositive – Two noun phrases are in apposition 
when the first phrase is headed by a common 
noun while the second one is a proper name with 
no space or punctuation between them. e.g. [�
s � w ][  � � ]	� 
U 8 x � 
� Z � { ([US 
president] [Clinton] visited Pyongyang last 
week.) This differs from English where two 
nouns are considered to be in apposition when 
one of them is an anaphor and separated by a 
comma from the other phrase, which is the most 
immediate proper name. (e.g. “Bil Gates, the 
chairman of Microsoft Corp”) 
Abbreviation – A noun phrase is an abbrevia-
tion when it is formed by using part of another 
noun phrase, e.g. � 
� �  �
 �  (Pyongyang 
Central Comunications Ofice) is commonly 
abbreviated as � � � . Since name abbreviations 
in Chinese are often given in an ad-hoc manner, 
it would be infeasible to generate a list of nam 
and abbreviations in advance. We therefore use 
the folowing heuristic: given two phrases, we 
test if one is an abbreviation of another by ex-
tracting each successive character from the 
shorter phrase and testing to see if it is included 
in the corresponding word from the longer 
phrase. Intuitively, we know that this is a com-
mon way of abbreviating terms; empirically, it 
usually gives us a correct result. 
Edit Distance – Abreviations and nicknames 
Feature f Function 
Noun Phrase Match -1 if the string of NP
i
 matches the string of NP
j
; else 0 
Head Noun Match -1 if head noun of NP
i
 matches the head noun of NP
j
; else 0 
Sentence Distance 
0 if NP
i
 and NP
j
 are in the same sentence; 
For non-pronouns: 1/10 if they are one sentence apart; and so 
on with maximum value 1; 
For pronouns: if more than two sentences apart, then 1 
Gender Agreement 1 if they do not match in gender; else 0 
Number Agreement 1 if they do not match in number; else 0 
Semantic Agreement 1 if they do not match in semantic class or unknown; else 0 
Proper Name Agreement 
1 if both are proper names, but mismatch on every word; else 
0 
Pronoun Agreement 
1 if either NP
i
 or NP
j
 is pronoun and mismatch in gender or 
number; else 0 
Demonstrative Noun Phrase -1 if NP
i
 is demonstrative and NP
i
 contains NP
j
; else 0 
Apositive -1 if NP
i
 and NP
j
 are in an appositive relationship; else 0 
Abreviation -1 if NP
i
 and NP
j
 are in an abbreviative relationship; else 0 
Edit Distance 
0 if NP
i
 and NP
j
 are the same, 1/(length of longer string) if 
one edit is needed to transform one to another, and so on. 
Table 2: Features and functions used in clustering algorithm 
 
43
are very commonly used in Chinese and even 
though the previous feature wil work on most of 
them, there are some common exceptions. To 
make sure that we catch those as well, we intro-
duced a Chinese-specific feature as a further test. 
Since abbreviations and nicknames are not usu-
ally substrings of the original strings, but wil 
stil share some common characters, we measure 
the Levenshtein distance, defined as the number 
of character insertions, deletions and substitu-
tions, between every potential antecedent-
anaphor pair. 
4.2 Distance Metric 
In order for the clustering algorithm to be able to 
group instances together by similarity, we need 
to determine a distance metric between two in-
stances – in our case, two noun phrases. For our 
system, we borrowed a simple distance metric 
from Cardie and Wagstaff (199) that sums up 
the results of a series of functions over the two 
phrases: 
(,)(,)
ij fij
fF
dstNPunctioNP
!
=
"
 
Table 2 presents the features and the correspond-
ing functions that were used in our system. Each 
function calculates a distance between the two 
phrases that is an indicator f the degree of in-
compatibility between the two phrases with re-
spect to a particular feature. The NOUN 
PHRASE, HEAD NOUN, DEMONSTRATIVE, 
APPOSITIVE and ABREVIATIVE functions 
test for compatibility and return a negative value 
when the two phrases are compatible for that 
term’s feature. The reason for the negative value 
returned is that if the two phrases match on this 
particular feature, then it is a strong indicator of 
coreference. Therefore, we reduce the distance 
between two phrases, making it more likely that 
they wil be clustered together into the same en-
tity. When there is a mismatch, however, it does 
not necessarily indicate that the two NPs are non-
coreferential, so we leave the distance between 
the NPs unchanged. 
Conversely, there are some features where a 
mismatch would indicate that the two NPs are 
absolutely non-compatible and wil definitely not 
refer to the same entity.  The DISTANCE, 
GENDER, NUMBER, SEMANTIC, PROPER 
NAME, PRONOUN and EDIT_DISTANCE 
functions return a positive value when the two 
phrases mismatch on that particular feature. A 
positive value results in a greater distance be-
tween two phrases, which makes it less likely for 
them to be clustered together. 
4.3 Clustering Algorithm 
Most of the previous work in clustering-based 
noun phrase coreference resolution has centered 
around the use of botom-up clustering methods, 
where each noun phrase is initially asigned to a 
singleton cluster by itself, and clusters which are 
“close enough” to each other are merged (Cardie 
& Wagstaff, 199; Angheluta et al., 204). 
In our system, we use a method called modi-
fied k-means clustering (Wilpon & Rabiner 
1985), which takes the oposite approach and 
uses a top-down approach to split clusters, inter-
leaved with a k-means iterative phase. Modified 
k-means clustering has been sucessfuly applied 
to speech recognition and it has the advantage of 
always being able to come to the optimal cluster-
ing (i.e. it is not dependent upon the starting state 
or merging order).  
Modified k-means starts off with all the in-
stances in one big cluster. The system then itera-
tively performs the folowing steps: 
1. For each cluster, find its centroid, de-
fined as the instance which is the closest 
to all other instances in the same cluster. 
2. For each instance: 
a. Calculate its distance to all the 
centroids.  
b. Find the centroid with the mini-
mum distance, and join its clus-
ter. 
3. Iterate 1-2 until instances stop moving 
between clusters. 
4. Find the cluster with the largest intra-
cluster distance. (Call this Cluster
max
 and 
its centroid, Centroid
max
.) If this distance 
is smaller than some threshold r, stop. 
5. From the instances inside Cluster
max
, find 
the pair that are the furthest apart from 
each other. 
a. Ad the pair of instances to the 
list of centroids and remove 
Centroid
max
 from the list. 
b. Repeat from Step 2. 
The algorithm thus alternates traditional k-means 
clustering with a step that adds new clusters to 
the pol of existing ones.  Used for coreference 
resolution, it splits up the instances into clusters 
in which the instances are more similar to each 
other than to instances in other clusters. 
The only thing left to do is to determine a suit-
able threshold. As functions that check for com-
patibility return negative values while positive 
distances indicate incompatibility, a threshold of 
0 would separate compatible and incompatible 
44
elements.  However, since the feature extraction 
wil not be totally accurate, (especially for the 
GENDER and NUMBER features which test for 
incompatibility) we chose to be more lenient 
with deciding whether two phrases should be 
clustered together, and used a threshold of r = 1 
to allow for posible errors. 
5 Evaluation 
Evaluation of coreference resolution systems has 
traditionally been performed with precision and 
recall.  The MUC competition defines recall as 
folows (Vilain et al., 195): 
(||()|
1
ii
Cp
R
!
=
"
 
Each C
i
 is a gold standard cluster (i.e. a set of 
phrases which we know refer to the same entity), 
and p(C
i
) is the partitioning of C
i
 by the auto-
matically-generated clusters. For precision, the 
role of the automatic and gold standard clusters 
are reversed. Our results were evaluated using 
the MUC scoring program which reports recall, 
precision and F-measure, where the F-measure is 
defined as the harmonic mean of precision and 
recall: 
! 
F=
2PR
P+R
 
Table 3 presents the results of our coreference 
resolution system on the outputs of both the pars-
ing-based and heuristic-based entity detection 
systems, as measured by the MUC-7 scoring 
program. For the purposes of comparison, we 
also present results of our clustering algorithm 
on the gold standard entities.  This gives us a 
sense of the uper bound that we could poten-
tially achieve if we got 10% accuracy on our 
mention detection phase. An additional baseline 
is generated by implementing a system that as-
sumes that all phrases refer to the same entity – 
i.e. it takes all the heuristically-generated phrases 
and puts them into one big cluster. This gives us 
an uper bound on the recall of the system. Yet 
another baseline, to see how easy the task is, is to 
merge mentions together if the “Noun Phrase 
Match” function tests true.  
From the results, it can be seen that our system 
achieves a performance gain of over 10 F-
Measure points over the simplest baseline, and 
over 8 F-Measure points over the more sophisti-
cated baseline. Unfortunately, due to corpus dif-
ferences, we cannot conduct a comparison with 
results found in previous work. 
An interesting observation is the fact that the 
heuristic-based entity recognizer achieves better 
performance than the one based on statistical 
parsing.  The parser is trained on the Chinese 
Penn Treebank, which tends to have relatively 
longer noun phrases, and as result, the phrases 
generated by the parser also tend to be on the 
long side. This causes errors at the entity recog-
nition phase, which results in a performance hit 
for the overall system. 
6 Analysis 
One interesting question to ask about the results 
is the contribution of any given individual fea-
ture to the result of the overall system. We have 
already investigated the effect of entity recogni-
tion, and in this section, we take a lok at the 
features for the clustering algorithm. Error! 
Reference source not found. presents the results 
of a series of experiments in which one feature at 
a time was removed from the clustering algo-
rithm. The last entry in the table shows the re-
sults of the ful system; the drop in performance 
when a feature is removed is indicative of its 
contribution. Judging from the results, the 3 fea-
tures that contribute the most to performance are 
the NOUN PHRASE MATCH, SEMANTIC 
AGREMENT and EDIT DISTANCE features. 
Two out of the three, NOUN PHRASE and 
EDIT DISTANCE, operate on lexical informa-
tion.  The importance of string matching to 
coreference resolution is consistent with findings 
in previous work (Yang et al. 204), which ar-
rived at the same conclusion for English. 
In addition, we note that the two Chinese-
specific features that were introduced, ABBRE-
VIATION and EDIT DISTANCE, both contrib-
ute significantly (as measured by a student’s t-
test) to the performance of the final system. 
 Recall Precision F-Measure 
Gold Standard Entities 78 8.5 82.9 
Baseline (Heuristic-based Entities) 80.9 4.1 57.1 
Baseline (Noun Phrase Match Only) 50.9 7.2 61.3 
Heuristic-Based Entity Recognition 62.9 7.1 69.3 
Parsing-Based Entity Recognition 42.5 62.9 50.7 
Table 3: Coreference Resolution Performance 
 
45
Of our features, those that contribute the least 
to the overall performance are the GENDER, 
NUMBER and DEMONSTRATIVE NOUN 
PHRASE features. For DEMONSTRATIVE 
NOUN PHRASE, the reason is because of data 
sparsity – there are just simply not enough ex-
amples that it would make any significant im-
pact. For the GENDER and NUMBER features, 
we find that the problem is mostly with errors in 
feature generation. 
To our knowledge, this is the first published 
result on unsupervised Chinese coreference reso-
lution. Due to differences in data, it is not posi-
ble to conduct a comparison of our work with 
previous results. 
7 Related Work 
Coreference resolution has attracted much atten-
tion in recent years, especially as a result of the 
MUC and ACE competitions. The approaches 
taken have exhibited a shift from knowledge-
based approaches to learning-based approaches. 
Many of the learning-based approaches recast 
coreference resolution as a binary classification 
task, which, given a pair of NPs, uses a trained 
classifier to determine whether they are corefer-
ent. Soon et al. (201) used this aproach with a 
12-feature decision tree-based classifier and Ng 
and Cardie (202) extended this approach with 
extra machine learning frameworks and a larger 
set of features. Yang et al. (204) extended this 
approach into an NP-cluster based approach, 
which considers the relationships between 
phrases and coreferential clusters. 
In addition, several unsupervised approaches 
have been proposed. Cardie and Wagstaff (199) 
re-cast the problem as a clustering task which 
applied a set of incompatibility functions and 
weights in the distance metric. Bean and Riloff 
(204) used information extraction patterns to 
identify contextual clues that would determine 
the compatibility between NPs. 
Al of the previously mentioned work has been 
for English. There has been relatively litle work 
in Chinese: Florian et al. (204) provides results 
using a language-independent framework on the 
Entity Detection and Tracking task (EDT). They 
formulate the detection subtask as a classification 
problem using a Robust Risk Minimization clas-
sifier combined with a Maximum Entropy classi-
fier. Their system performs significantly well on 
English, Chinese and Arabic, however, the sys-
tem suffers from small amount of training data 
(90K characters for Chinese, in contrast with 
340K words for English). Their system obtained 
an ACE value of 58.8 on the ACE evaluation 
data on Chinese. Finally, Zhou et al. (205) pro-
posed a unified Transformation-Based Learning 
framework on Chinese EDT. The TBL tracking 
model loks at pairs of NPs at a time and classi-
fies them as being coreferent or not based on the 
values of six features. They report an ACE score 
of 63.3 on their dataset. 
8 Conclusions and Future Work 
In this paper, we have presented an unsupervised 
approach to Chinese coreference resolution. Our 
approach performs resolution by clustering, with 
the advantage that no annotated training data is 
needed. We evaluated our approach using a cor-
pus which we developed using standard annota-
tion schemes, and find that our system achieves 
an error reduction rate of almost 30% over the 
baseline. We also analyze the performance of our 
system by investigating the contribution of indi-
vidual features to our system. The analysis ilus-
Removed feature Recal Precision F-measure 
Noun Phrase Match 59.8 75.9 6.9 
Head Noun atch 60.4 76.2 67.4 
Sentence Distance 63.2 73.3 67.8 
Gender Agreement 62.9 76.3 68.9 
Number Agreement 63.2 75.9 69 
Semantic Agreement 60.5 73 6.2 
Proper Name Agreement 63 76.2 69 
Pronoun Agreement 61.3 76.9 68.2 
Demonstrative Noun Phrase 62.2 7.9 69.2 
Apositive 60.1 76.9 67.5 
Abreviation 61.6 7 68.4 
Edit Distance 62.4 72.8 67.2 
None (Al Features) 62.9 7.1 69.3 
Table 4: Contribution of individual features to overall performance. 
 
46
trates the contribution of the new language-
specific features. 
While the results produced by our system are 
impressive, it should be noted that all our fea-
tures consider only the mention phrase itself. We 
consider this to be a rather simplistic and incom-
plete. In future work, we plan to investigate the 
use of more sophisticated features, including 
contextual features, to improve the performance 
of our system. 

References 
ANGHELUTA R., JEUNIAUX P., RUDRADEB M., 
MOENS M.F. 204. Clustering Algorithms for 
Noun Phrase Coreference Resolution. Procedings 
of the 7th International Conference on the Statisti-
cal Analysis of Textual Data. 
BEAN, D. and RILOF, E. 204. Unsupervised learn-
ing of contextual role knowledge for coreference 
resolution. In Proc. of HLT/NACL, pages 297–
304. 
BIKEL, D. M. 204. A Distributional Analysis of a 
Lexicalized Statistical Parsing Model. In Proced-
ings of EMNLP, Barcelona 
CARDIE, C. and WAGSTAF, K. 199. Noun phrase 
coreference as clustering. In Procedings of the 
199 Joint SIGDAT Conference on Empirical 
Methods in Natural Language Procesing and Very 
Large Corpora, pages 82-89. 
FLORIAN, R., HASAN, H., ITYCHERIAH, A., 
JING, H., KAMBHATLA, N., LUO, X., 
NICOLOV, N., and ROUKOS, S. 204. Statistical 
Model for Multilingual Entity Detection and 
Tracking. In Procedings of 204 anual meting 
of the North American Chapter of the Asociation 
for Computational Linguistics (HLT-NACL 
204). 
FUNG, P., NGAI, G., YANG, Y. S., and CHEN, B.F. 
204. A maximum-entropy Chinese parser aug-
mented by Transformation-Based Learning. ACM 
Transactions on Asian Language Information 
Procesing (TALIP), 3(2), p 159-168. 
GAO J.F., LI M. and HUANG C.N. 203. Improved 
souce-chanel model for Chinese wordsegmenta-
tion. In Proc. of ACL203. 
HARABAGIU, S., BUNESCU, R.,and MAIORANO, 
S. 201. Text and Knowledge Mining for Corefer-
ence Resolution, in Procedings of the 2nd Met-
ing of the North American Chapter of the Asocia-
tion of Computational Linguistics (NACL-201). 
HIRSCHMAN, L. and CHINCHOR, N. 197. MUC7 
Coreference Task Definition, 
htp:/ww.itl.nist.gov/iaui/894.02/related_projects
/muc/procedings/co_task.html. 
LANCASHIRE, D. 205. Adsotrans Chinese-English 
anotation. htp:/ww.adsotrans.com/. 
LDC. 204. Chinese Anotation Guidelines for Entity 
Detection and Tracking. 
htp:/ww.ldc.upen.edu/Projects/ACE/Data. 
MUC-7. 198. Procedings of the Seventh Mesage 
Understanding Conference (MUC-7). organ 
Kaufman, San Francisco, CA. 
NG V. 205. Machine learning for coreference 
resolution: From local clasification to global 
ranking. In Procedings of the 43rd Anual Met-
ing of the Asociation for Computational Linguis-
tics (ACL), 205. 
NG, V. and CARDIE, C. 202. Improving machine 
learning aproaches to coreference resolution. In 
Procedings of the 40rd Anual Meting of the As-
sociation for Computational Linguistics, Pages 
104-11. 
NG V. and CARDIE C. 203. Botstraping Corefer-
ence Clasifiers with Multiple Machine Learning 
Algorithms. Procedings of the 203 Conference 
on Empirical Methods in Natural Language Proc-
esing (EMNLP-203), Asociation for Computa-
tional Linguistics, 203. 
SON, W., NG, H., and LIM, D. 201. A machine 
learning aproach to coreference resolution of 
noun phrases. Computational Linguistics, 
27(4):521-54. 
VILAIN, M., BURGER, J., ABERDEN, J., CON-
NOLY, D., and HIRSCHMAN, L. 195. A 
model-theoretic coreference scoring scheme. In 
Procedings of the Sixth Mesage Understanding 
Conference (MUC-6), pages 45–52, San Francisco, 
CA. Morgan Kaufman. 
WILPON, J., AND RABINER, L. 1985. A modified 
K-means clustering algorithm for use in isolated 
word recognition. In IEE Transactions on Acous-
tics, Spech, Signal Procesing. ASP-3(3), 587-
594. 
YANG, X., ZHOU, G., SU, J., and TAN, C. L. 204. 
An NP-Cluster Based Aproach to Coreference 
Resolution. Procedings of the 20th International 
Conference on Computational Linguistics (COL-
ING204). 
ZHOU Y., HUANG C., GAO J., WU L. 205. Trans-
formation Based Chinese Entity Detection and 
Tracking. Procedings of the Second International 
Joint Conference on Natural Language Procesing. 
