Evaluating Automated and Manual Acquisition of 
Anaphora Resolution Strategies 
Chinatsu Aone and Scott William Bennett 
Systems Research and Applications Corporation (SRA) 
2000 15th Street North 
Arlington, VA 22201 
aonec~sra.corn, bennett~sra.com 
Abstract 
We describe one approach to build an au- 
tomatically trainable anaphora resolution 
system. In this approach, we use Japanese 
newspaper articles tagged with discourse 
information as training examples for a ma- 
chine learning algorithm which employs 
the C4.5 decision tree algorithm by Quin- 
lan (Quinlan, 1993). Then, we evaluate 
and compare the results of several variants 
of the machine learning-based approach 
with those of our existing anaphora resolu- 
tion system which uses manually-designed 
knowledge sources. Finally, we compare 
our algorithms with existing theories of 
anaphora, in particular, Japanese zero pro- 
nouns. 
1 Introduction 
Anaphora resolution is an important but still diffi- 
cult problem for various large-scale natural language 
processing (NLP) applications, such as information 
extraction and machine tr~slation. Thus far, no 
theories of anaphora have been tested on an empir- 
ical basis, and therefore there is no answer to the 
"best" anaphora resolution algorithm. I Moreover, 
an anaphora resolution system within an NLP sys- 
tem for real applications must handle: 
• degraded or missing input (no NLP system 
has complete lexicons, grammars, or semantic 
knowledge and outputs perfect results), and 
• different anaphoric phenomena in different do- 
mains, languages, and applications. 
Thus, even if there exists a perfect theory, it might 
not work well with noisy input, or it would not cover 
all the anaphoric phenomena. 
1Walker (Walker, 1989) compares Brennan, Friedman 
a~ad Pollard's centering approach (Brennan et al., 1987) 
with Hobbs' algorithm (Hohbs, 1976) on a theoretical 
basis. 
These requirements have motivated us to de- 
velop robust, extensible, and trainable anaphora 
resolution systems. Previously (Aone and Mc- 
Kee, 1993), we reported our data-driven multilin- 
gual anaphora resolution system, which is robust, 
exteusible, and manually trainable. It uses dis- 
course knowledge sources (KS's) which are manu- 
ally selected and ordered. (Henceforth, we call the 
system the Manually-Designed Resolver, or MDR.) 
We wanted to develop, however, truly automatically 
trainable systems, hoping to improve resolution per- 
formance and reduce the overhead of manually con- 
structing and arranging such discourse data. 
In this paper, we first describe one approach 
we are taking to build an automatically trainable 
anaphora resolution system. In this approach, we 
tag corpora with discourse information, and use 
them as training examples for a machine learning 
algorithm. (Henceforth, we call the system the Ma- 
chine Learning-based Resolver, or MLR.) Specifi- 
cally, we have tagged Japanese newspaper articles 
about joint ventures and used the C4.5 decision tree 
algorithm by Quinlan (Quinlan, 1993). Then, we 
evaluate and compare the results of the MLR with 
those produced by the MDR. Finally, we compare 
our algorithms with existing theories of anaphora, 
in particular, Japanese zero pronouns. 
2 Applying a Machine Learning 
Technique to Anaphora Resolution 
In this section, we first discuss corpora which we 
created for training and testing. Then, we describe 
the learning approach chosen, and discuss training 
features and training methods that we employed for 
our current experiments. 
2.1 Training and Test Corpora 
In order to both train and evaluate an anaphora 
resolution system, we have been developing cor- 
pora which are tagged with discourse information. 
The tagging has been done using a GUI-based tool 
called the Discourse Tagging Tool (DTTool) ac- 
cording to "The Discourse Tagging Guidelines" we 
122 
have developed. 2 The tool allows a user to link an 
anaphor with its antecedent and specify the type 
of the anaphor (e.g. pronouns, definite NP's, etc.). 
The tagged result can be written out to an SGML- 
marked file, as shown in Figure 1. 
For our experiments, we have used a discourse- 
tagged corpus which consists of Japanese newspaper 
articles about joint ventures. The tool lets a user de- 
fine types of anaphora as necessary. The anaphoric 
types used to tag this corpus are shown in Table 1. 
NAME anaphora are tagged when proper names 
are used anaphorically. For example, in Figure 1, 
"Yamaichi (ID=3)" and "Sony-Prudential (ID=5)" 
referring back to "Yamaichi Shouken (ID=4)" (Ya- 
maichi Securities) and "Sony-Prudential Seimeiho- 
ken (ID=6)" (Sony-Prudential Life Insurance) re- 
spectively are NAME anaphora. NAME anaphora 
in Japanese are different from those in English in 
that any combination of characters in an antecedent 
can be NAME anaphora as long as the character or- 
der is preserved (e.g. "abe" can be an anaphor of 
"abcde"). 
Japanese definite NPs (i.e. DNP anaphora) are 
those prefixed by "dou" (literally meaning "the 
same"), "ryou" (literally meaning "the two"), and 
deictic determiners like "kono"(this) and "sono" 
(that). For example, "dou-sha" is equivalent to "the 
company", and "ryou-koku" to "the two countries". 
The DNP anaphora with "dou" and "ryou" pre- 
fixes are characteristic of written, but not spoken, 
Japanese texts. 
Unlike English, Japanese has so-called zero pro- 
nouns, which are not explicit in the text. In these 
cases, the DTTool lets the user insert a "Z" marker 
just before the main predicate of the zero pronoun to 
indicate the existence of the anaphor. We made dis- 
tinction between QZPRO and ZPRO when tagging 
zero pronouns. QZPRO ("quasi-zero pronoun") is 
chosen when a sentence has multiple clauses (sub- 
ordinate or coordinate), and the zero pronouns in 
these clauses refer back to the subject of the initial 
clause in the same sentence, as shown in Figure 2. 
The anaphoric types are sub-divided according to 
more semantic criteria such as organizations, people, 
locations, etc. This is because the current appli- 
cation of our multilingual NLP system is informa- 
tion extraction (Aone et al., 1993), i.e. extracting 
from texts information about which organizations 
are forming joint ventures with whom. Thus, resolv- 
ing certain anaphora (e.g. various ways to refer back 
to organizations) affects the task performance more 
than others, as we previously reported (Aone, 1994). 
Our goal is to customize and evaluate anaphora res- 
olution systems according to the types of anaphora 
when necessary. 
2Our work on the DTTool and tagged corpora was 
reported in a recent paper (Aone and Bennett, 1994). 
2.2 Learning Method 
While several inductive learning approaches could 
have been taken for construction of the trainable 
anaphoric resolution system, we found it useful to 
be able to observe the resulting classifier in the form 
of a decision tree. The tree and the features used 
could most easily be compared to existing theories. 
Therefore, our initial approach has been to employ 
Quinlan's C4.5 algorithm at the heart of our clas- 
sification approach. We discuss the features used 
for learning below and go on to discuss the training 
methods and how the resulting tree is used in our 
anaphora resolution algorithm. 
2.3 Training Features 
In our current machine learning experiments, we 
have taken an approach where we train a decision 
tree by feeding feature vectors for pairs of an anaphor 
and its possible antecedent. Currently we use 66 
features, and they include lezical (e.g. category), 
syntactic (e.g. grammatical role), semantic (e.g. se- 
mantic class), and positional (e.g. distance between 
anaphor and antecedent) features. Those features 
can be either unary features (i.e. features of either an 
anaphor or an antecedent such as syntactic number 
values) or binary features (i.e. features concerning 
relations between the pairs such as the positional re- 
lation between an anaphor and an antecedent.) We 
started with the features used by the MDR, gener- 
alized them, and added new features. The features 
that we employed are common across domains and 
languages though the feature values may change in 
different domains or languages. Example of training 
features are shown in Table 2. 
The feature values are obtained automatically by 
processing a set of texts with our NLP system, which 
performs lexical, syntactic and semantic analysis and 
then creates discourse markers (Kamp, 1981) for 
each NP and S. 3 Since discourse markers store the 
output of lexical, syntactic and semantic process- 
ing, the feature vectors are automatically calculated 
from them. Because the system output is not always 
perfect (especially given the complex newspaper ar- 
ticles), however, there is some noise in feature values. 
2.4 Training Methods 
We have employed different training methods using 
three parameters: anaphoric chains, anaphoric type 
identification, and confidence factors. 
The anaphoric chain parameter is used in selecting 
training examples. When this parameter is on, we 
select a set of positive training examples and a set 
of negative training examples for each anaphor in a 
text in the following way: 
3 Existence of zero pronouns in sentences is detected 
by the syntax module, and discourse maxkers are created 
for them. 
123 
<CORe: m='I"><COREF n~'4">ttl--lEff-</mR~:<u.~J- m='s'>y-'-- • ~')l,~Y:,,~,)t,¢.@~l~ (~P,-'ll~l~:.~t, :¢4t. lr)~) 
<CORE\]: m='O" rcPE='~ RB:='i"></COR~>III@b~. ~)q~'~6<COR~ ZD='2e rVPE='ZPm-t~-" REFf'I"></COREF>~Ii 
3"~. <CORe: ZD='~' WRf"NANE--OR6" RB:f'4">ttI--</COE~<COREF ~"8">q~,l,~ltC)~e't-"~.'3tt~ttll~:~'~'& 
</COR~<COR~ m='s" WR='tt~E-O~ REFf"#'>y-'---. ~')t,-~>-b,,v)l,</mR~{:~-, <COmF n)="¢' WPE='Dm" REF='8"> 
C r~ 5, ~-7" I, <,'CUT~ ~ <CORBF m='9" WR='ZT4~O-O~ 8EEf'5"> </OR~ ff -~ T <CO~ m=" ~o" TYR='~O-U~ RE~='5"> 
Figure 1: Text Tagged with Discourse Information using SGML 
Tags 
DNP 
DNP-F 
DNP-L 
DNP-ORG 
DNP-P 
DNP-T 
DNP-BOTH 
DNP-BOTH-ORG 
DNP-BOTH-L 
DNP-BOTH-P 
REFLEXIVE 
NAME 
NAME-F 
NAME-L 
NAME-ORG 
NAME-P 
DPRO 
LOCI 
TIMEI 
QZPRO 
QZPRO-ORG 
QZPRO-P 
ZPRO 
ZPRO-IMP 
ZPRO-ORG 
ZPRO-P 
Table 1: Summary of Anaphoric Types 
Meaning 
Definite NP 
Definite NP 
Definite NP 
Definite NP 
Definite NP 
Definite NP 
whose referent is a facility 
whose referent is a location 
whose referent is an organization 
whose referent is a person 
whose referent is time 
Definite NP whose referent is two entities 
Definite NP whose referent is two organization entities 
Definite NP whose referent is two location entities 
Definite NP whose referent is two person entities 
Reflexive expressions (e.$. "jisha ~) 
Proper name 
Proper name for facility 
Proper name for location 
Proper name for organization 
Proper name for person 
Deictic pronoun (this, these) 
Locational indexical (here, there) 
Time indexical (now, then, later) 
Quasi-zero pronoun 
Quasi-zero pronoun whose referent is an organization 
Quasi-zero pronoun whose referent is a person 
Zero pronoun 
Zero pronoun in an impersonal construction 
Zero pronoun whose referent is an organization 
Zero pronoun whose referent is a person 
JDEL Dou-ellipsis 
SONY-wa RCA-to teikeishi, VCR-wo QZPRO 
Sony-subj RCA-with joint venture VCR-obj (it) 
kaihatsusuru to QZPRO happyoushita 
develop that (it) announced 
"(SONY) announced that SONY will form a joint venture with RCA 
and (it) will develop VCR's." 
Figure 2: QZPRO Example 
Table 2: Examples of Training Features 
Unary feature Binaxy feature 
Lexical category matching-category 
Syntactic topicalized matching-topicalized 
Semantic semantic-class subsuming-semantic-class 
Positional antecedent-precedes-anaphor 
124 
Positive training examples are those anaphor- 
antecedent pairs whose anaphor is directly linked to 
its antecedent in the tagged corpus and also whose 
anaphor is paired with one of the antecedents on the 
anaphoric chain, i.e. the transitive closure between 
the anaphor and the first mention of the antecedent. 
For example, if B refers to A and C refers to B, C- 
A is a positive training example as well as B-A and 
C-B. 
Negative training examples are chosen by pairing 
an anaphor with all the possible antecedents in a text 
except for those on the transitive closure described 
above. Thus, if there are possible antecedents in the 
text which are not in the C-B-A transitive closure, 
say D, C-D and B-D are negative training examples. 
When the anaphoric chain parameter is off, only 
those anaphor-antecedent pairs whose anaphora are 
directly linked to their antecedents in the corpus are 
considered as positive examples. Because of the way 
in which the corpus was tagged (according to our 
tagging guidelines), an anaphor is linked to the most 
recent antecedent, except for a zero pronoun, which 
is linked to its most recent overt antecedent. In other 
words, a zero pronoun is never linked to another zero 
pronoun. 
The anaphoric type identification parameter is 
utilized in training decision trees. With this param- 
eter on, a decision tree is trained to answer "no" 
when a pair of an anaphor and a possible antecedent 
are not co-referential, or answer the anaphoric type 
when they are co-referential. If the parameter is off, 
a binary decision tree is trained to answer just "yes" 
or "no" and does not have to answer the types of 
anaphora. 
The confidence factor parameter (0-100) is used 
in pruning decision trees. With a higher confidence 
factor, less pruning of the tree is performed, and thus 
it tends to overfit the training examples. With a 
lower confidence factor, more pruning is performed, 
resulting in a smaller, more generalized tree. We 
used confidence factors of 25, 50, 75 and 100%. 
The anaphoric chain parameter described above 
was employed because an anaphor may have more 
than one "correct" antecedent, in which case there 
is no absolute answer as to whether one antecedent 
is better than the others. The decision tree approach 
we have taken may thus predict more than one an- 
tecedent to pair with a given anaphor. Currently, 
confidence values returned from the decision tree are 
employed when it is desired that a single antecedent 
be selected for a given anaphor. We are experiment- 
ing with techniques to break ties in confidence values 
from the tree. One approach is to use a particular 
bias, say, in preferring the antecedent closest to the 
anaphor among those with the highest confidence (as 
in the results reported here). Although use of the 
confidence values from the tree works well in prac- 
tice, these values were only intended as a heuristic 
for pruning in Quinlan's C4.5. We have plans to use 
cross-validation across the training set as a method 
of determining error-rates by which to prefer one 
predicted antecedent over another. 
Another approach is to use a hybrid method where 
a preference-trained decision tree is brought in to 
supplement the decision process. Preference-trained 
trees, like that discussed in Connolly et al. (Connolly 
et al., 1994), are trained by presenting the learn- 
ing algorithm with examples of when one anaphor- 
antecedent pair should be preferred over another. 
Despite the fact that such trees are learning prefer- 
ences, they may not produce sufficient preferences to 
permit selection of a single best anaphor-antecedent 
combination (see the "Related Work" section be- low). 
3 Testing 
In this section, we first discuss how we configured 
and developed the MLRs and the MDR for testing. 
Next, we describe the scoring methods used, and 
then the testing results of the MLRs and the MDR. 
In this paper, we report the results of the four types 
of anaphora, namely NAME-ORG, QZPRO-ORG, 
DNP-ORG, and ZPRO-ORG, since they are the ma- 
jority of the anaphora appearing in the texts and 
most important for the current domain (i.e. joint 
ventures) and application (i.e. information extrac- 
tion). 
3.1 Testing the MLRa 
To build MLRs, we first trained decision trees with 
1971 anaphora 4 (of which 929 were NAME-ORG; 
546 QZPRO-ORG; 87 DNP-ORG; 282 ZPRO-ORG) 
in 295 training texts. The six MLRs using decision 
trees with different parameter combinations are de- 
scribed in Table 3. 
Then, we trained decision trees in the MLR-2 
configuration with varied numbers of training texts, 
namely 50, 100, 150,200 and 250 texts. This is done 
to find out the minimum number of training texts to 
achieve the optimal performance. 
3.2 Testing the MDR 
The same training texts used by the MLRs served 
as development data for the MDR. Because the NLP 
system is used for extracting information about joint 
ventures, the MDR was configured to handle only 
the crucial subset of anaphoric types for this ex- 
periment, namely all the name anaphora and zero 
pronouns and the definite NPs referring to organi- 
zations (i.e. DNP-ORG). The MDR applies different 
sets of generators, filters and orderers to resolve dif- 
ferent anaphoric types (Aone and McKee, 1993). A 
generator generates a set of possible antecedent hy- 
potheses for each anaphor, while a filter eliminates 
*In both training and testing, we did not in- 
clude anaphora which refer to multiple discontinuous 
antecedents. 
125 
MLR-1 
MLR-2 
MLR-3 
MLR-4 
MLR-5 
MLR-6 
Table 3: Six Configurations of MLRs 
yes no 
yes no 
yes no 
yes no 
yes yes 
no no 
confidence factor 
lOO% 
75% ' 
50% " 
25% 
75% 
75% 
unlikely hypotheses from the set. An orderer ranks 
hypotheses in a preference order if there is more than 
one hypothesis left in the set after applying all the 
applicable filters. Table 4 shows KS's employed for 
the four anaphoric types. 
3.3 Scoring Method 
We used recall and precision metrics, as shown in 
Table 5, to evaluate the performance of anaphora 
resolution. It is important to use both measures 
because one can build a high recall-low precision 
system or a low recall-high precision system, neither 
of which may be appropriate in certain situations. 
The NLP system sometimes fails to create discourse 
markers exactly corresponding to anaphora in texts 
due to failures of hxical or syntactic processing. In 
order to evaluate the performance of the anaphora 
resolution systems themselves, we only considered 
anaphora whose discourse markers were identified by 
the NLP system in our evaluation. Thus, the system 
performance evaluated against all the anaphora in 
texts could be different. 
Table 5: Recall and Precision Metrics for Evaluation 
Recall = Nc/I, Precision = Nc/Nn 
I Number of system-identified anaphora in input 
N~ Number of correct resolutions 
Nh Number of resolutions attempted 
3.4 Testing Results 
The testing was done using 1359 anaphora (of which 
1271 were one of the four anaphoric types) in 200 
blind test texts for both the MLRs and the MDR. It 
should be noted that both the training and testing 
texts are newspaper articles about joint ventures, 
and that each article always talks about more than 
one organization. Thus, finding antecedents of orga- 
nizational anaphora is not straightforward. Table 6 
shows the results of six different MLRs and the MDR 
for the four types of anaphora, while Table 7 shows 
the results of the MLR-2 with different sizes of train- 
ing examples, 
4 Evaluation 
4.1 The MLRs vs. the MDR 
Using F-measures 5 as an indicator for overall perfor- 
mance, the MLRs with the chain parameters turned 
on and type identification turned off (i.e. MLR-1, 2, 
3, and 4) performed the best. MLR-1, 2, 3, 4, and 5 
all exceeded the MDR in overall performance based 
on F-measure. 
Both the MLRs and the MDR used the char- 
acter subsequence, the proper noun category, and 
the semantic class feature values for NAME-ORG 
anaphora (in MLR-5, using anaphoric type identifi- 
cation). It is interesting to see that the MLR addi- 
tionally uses the topicalization feature before testing 
the semantic class feature. This indicates that, infor- 
mation theoretically, if the topicalization feature is 
present, the semantic class feature is not needed for 
the classification. The performance of NAME-ORG 
is better than other anaphoric phenomena because 
the character subsequence feature has very high an- 
tecedent predictive power. 
4.1.1 Evaluation of the MLIts 
Changing the three parameters in the MLRs 
caused changes in anaphora resolution performance. 
As Table 6 shows, using anaphoric chains without 
anaphoric type identification helped improve the 
MLRs. Our experiments with the confidence fac- 
tor parameter indicates the trade off between recall 
and precision. With 100% confidence factor, which 
means no pruning of the tree, the tree overfits the 
examples, and leads to spurious uses of features such 
as the number of sentences between an anaphor and 
an antecedent near the leaves of the generated tree. 
This causes the system to attempt more anaphor 
resolutions albeit with lower precision. Conversely, 
too much pruning can also yield poorer results. 
MLR-5 illustrates that when anaphoric type iden- 
tification is turned on the MLR's performance drops 
SF-measure is calculated by: 
F= (~2+1.0) × P x R 
#2 x P+R 
where P is precision, R is recall, and /3 is the relative 
importance given to recall over precision. In this case, 
= 1.0. 
126 
NAME-ORG 
DNP-ORG 
QZPRO-ORG 
ZPRO-ORG 
Generators 
Table 4: KS's used by the MDR 
Filters 
current-text 
current-text 
current-paragraph 
current-paragraph 
syntactic-category-propn 
nam~chax-subsequence 
semantic-class-org 
semantic-dass-org 
semantic-amount-singular 
not-in-the-same-dc 
semantic-dass-from-pred 
not-in-the-same-dc 
sere antic-dass-from-pred 
Orderers 
reverse-recency 
topica\]ization 
subject-np 
recency 
topica\]ization 
subject-np 
category-np 
recency 
topicalization 
subject-np 
category-np 
recency 
# exmpls 
MLR-1 
MLR-2 
MLR-3 
MLR-4 
MLR-5 
MLR-6 
MDR 
Table 6: Recall and Precision of the MLRs and the MDR 
NAME-ORG 
631 
R P 
84.79 92.24 
84.79 93.04 
83.20 94.09 
83.84 94.30 
85.74 92.80 
68.30 91.70 
76.39 90.09 
DNP-ORG 
54 
R P 
44.44 50.00 
44.44 52.17 
37.04 58.82 
38.89 60.00 
44.44 55.81 
29.63 64.00 
35.19 50.00 
383 
R P 
65.62 80.25 
64.84 84.69 
63.02 84.91 
64.06 85.12 
56.51 89.67 
54.17 90.83 
67.19 67.19 
ZPRO-ORG 
203 
R P 
4O.78 64.62 
39.32 73.64 
35.92 73.27 
37.86 76.47 
15.53 78.05 
13.11 75.00 
43.20 43.20 
Average 
1271 
R P 
70.20 83.49 
69.73 86.73 
67.53 88.04 
68.55 88.55 
63.84 89.55 
53.49 89.74 
66.51 72.91 
F-measure 
1271 
F 
76.27 
77.30 
76.43 
77.28 
74.54 
67.03 
69.57 
texts 
50 
I00 
150 
2OO 
25O 
295 
MDR 
Table 7:MLR-2 Configuration with Varied Training Data Sizes 
NAME-ORG DNP-ORG QZPRO-ORG ZPRO-ORG 
R P R P R 
81.30 91.94 35.19 48.72 59.38 
82.09 92.01 38.89 53.85 63.02 
82.57 91.89 48.15 60.47 55.73 
83.99 91.70 46.30 60.98 63.02 
84.79 93.21 44.44 53.33 65.10 
84.79 93.04 44.44 52.17 64.84 
76.39 90.09 35.19 50.00 67.19 
Average F-measure 
P R P R P F 
76.77 29.13 56.07 64.31 81.92 72.06 
85.82 28.64 62.77 65.88 85.89 74.57 
85.60 20.39 70.00 62.98 87.28 73.17 
82.88 36.41 65.22 68.39 84.99 75.79 
83.89 40.78 73.04 70.04 86.53 77.42 
84.69 39.32 73.64 69.73 86.73 77.30 
67.19 43.20 43.20 66.51 72.91 69.57 
127 
but still exceeds that of the MDR. MLR-6 shows the 
effect of not training on anaphoric chains. It results 
in poorer performance than the MLR-1, 2, 3, 4, and 
5 configurations and the MDR. 
One of the advantages of the MLRs is that due 
to the number of different anaphoric types present 
in the training data, they also learned classifiers 
for several additional anaphoric types beyond what 
the MDR could handle. While additional coding 
would have been required for each of these types 
in the MDR, the MLRs picked them up without ad- 
ditional work. The additional anaphoric types in- 
cluded DPRO, REFLEXIVE, and TIMEI (cf. Ta- 
ble 1). Another advantage is that, unlike the MDR, 
whose features are hand picked, the MLRs automat- 
ically select and use necessary features. 
We suspect that the poorer performance of ZPRO- 
OR(; and DNP-ORG may be due to the following 
deficiency of the current MLR algorithms: Because 
anaphora resolution is performed in a "batch mode" 
for the MLRs, there is currently no way to perco- 
late the information on an anaphor-antecedent link 
found by a system after each resolution. For exam- 
ple, if a zero pronoun (Z-2) refers to another zero 
pronoun (Z-l), which in turn refers to an overt NP, 
knowing which is the antecedent of Z-1 may be im- 
portant for Z-2 to resolve its antecedent correctly. 
However, such information is not available to the 
MLRs when resolving Z-2. 
4.1.2 Evaluation of the MDR 
One advantage of the MDR is that a tagged train- 
ing corpus is not required for hand-coding the reso- 
lution algorithms. Of course, such a tagged corpus 
is necessary to evaluate system performance quan- 
titatively and is also useful to consult with during 
algorithm construction. 
However, the MLR results seem to indicate the 
limitation of the MDR in the way it uses orderer 
KS's. Currently, the MDR uses an ordered list of 
multiple orderer KS's for each anaphoric type (cf. 
Table 4), where the first applicable orderer KS in the 
list is used to pick the best antecedent when there is 
more than one possibility. Such selection ignores the 
fact that even anaphora of the same type may use 
different orderers (i.e. have different preferences), de- 
pending on the types of possible antecedents and on 
the context in which the particular anaphor was used 
in the text. 
4.2 Training Data Size vs. Performance 
Table 7 indicates that with even 50 training texts, 
the MLR achieves better performance than the 
MDR. Performance seems to reach a plateau at 
about 250 training examples with a F-measure of 
around 77.4. 
5 Related Work 
Anaphora resolution systems for English texts based 
on various machine learning algorithms, including a 
decision tree algorithm, are reported in Connolly et 
al. (Connolly et al., 1994). Our approach is different 
from theirs in that their decision tree identifies which 
of the two possible antecedents for a given anaphor 
is "better". The assumption seems to be that the 
closest antecedent is the "correct" antecedent. How- 
ever, they note a problem with their decision tree in 
that it is not guaranteed to return consistent clas- 
sifications given that the "preference" relationship 
between two possible antecedents is not transitive. 
Soderland and Lehnert's machine learning-based 
information extraction system (Soderland and Lehn- 
ert, 1994) is used specifically for filling particular 
templates from text input. Although a part of its 
task is to merge multiple referents when they corefer 
(i.e. anaphora resolution), it is hard to evaluate how 
their anaphora resolution capability compares with 
ours, since it is not a separate module. The only 
evaluation result provided is their extraction result. 
Our anaphora resolution system is modular, and can 
be used for other NLP-based applications such as 
machine translation. Soderland and Lehnert's ap- 
proach relies on a large set of filled templates used for 
training. Domain-specific features from those tem- 
plates are employed for the learning. Consequently, 
the learned classifiers are very domain-specific, and 
thus the approach relies on the availability of new 
filled template sets for porting to other domains. 
While some such template sets exist, such as those 
assembled for the Message Understanding Confer- 
ences, collecting such large amounts of training data 
for each new domain may be impractical. 
Zero pronoun resolution for machine translation 
reported by Nakaiwa and Ikehara (Nakaiwa and Ike- 
hara, 1992) used only semantic attributes of verbs 
in a restricted domain. The small test results (102 
sentences from 29 articles) had high success rate 
of 93%. However, the input was only the first 
paragraphs of newspaper articles which contained 
relatively short sentences. Our anaphora resolu- 
tion systems reported here have the advantages of 
domain-independence and full-text handling without 
the need for creating an extensive domain knowledge 
base. 
Various theories of Japanese zero pronouns have 
been proposed by computational linguists, for ex- 
ample, Kameyama (Kameyama, 1988) and Walker 
et aL (Walker et al., 1994). Although these the- 
ories are based on dialogue examples rather than 
texts, "features" used by these theories and those 
by the decision trees overlap interestingly. For ex- 
ample, Walker et ai. proposes the following ranking 
scheme to select antecedents of zero pronouns. 
(GRAMMATICAL or ZERO) TOPIC > EMPATHY > 
SUBJECT > OBJECT2 > OBJECT > OTHERS 
128 
In examining decision trees produced with anaphoric 
type identification turned on, the following features 
were used for QZPRO-ORG in this order: topical- 
ization, distance between an anaphor and an an- 
tecedent, semantic class of an anaphor and an an- 
tecedent, and subject NP. We plan to analyze further 
the features which the decision tree has used for zero 
pronouns and compare them with these theories. 
6 Summary and Future Work 
This paper compared our automated and manual ac- 
quisition of anaphora resolution strategies, and re- 
ported optimistic results for the former. We plan 
to continue to improve machine learning-based sys- 
tem performance by introducing other relevant fea- 
tures. For example, discourse structure informa- 
tion (Passonneau and Litman, 1993; Hearst, 1994), 
if obtained reliably and automatically, will be an- 
other useful domain-independent feature. In addi- 
tion, we will explore the possibility of combining 
machine learning results with manual encoding of 
discourse knowledge. This can be accomplished by 
allowing the user to interact with the produced clas- 
sifters, tracing decisions back to particular examples 
and allowing users to edit features and to evaluate 
the efficacy of changes. 

References 
Chinatsu Aone and Scott W. Bennett. 1994. Dis- 
course Tagging Tool and Discourse-tagged Mul- 
tilingual Corpora. In Proceedings of Interna- 
tional Workshop on Sharable Natural Language 
Resources (SNLR). 
Chinatsu Aone and Douglas McKee. 1993. 
Language-Independent Anaphora Resolution Sys- 
tem for Understanding Multilingual Texts. In 
Proceedings of 31st Annual Meeting of the ACL. 
Chinatsu Aone, Sharon Flank, Paul Krause, and 
Doug McKee. 1993. SRA: Description of the 
SOLOMON System as Used for MUC-5. In Pro- 
ceedings of Fourth Message Understanding Con- 
ference (MUC-5). 
Chinatsu Aone. 1994. Customizing and Evaluating 
a Multilingual Discourse Module. In Proceedings 
of the 15th International Conference on Compu- 
tational Linguistics (COLING). 
Susan Brennan, Marilyn Friedman, and Carl Pol- 
lard. 1987. A Centering Approach to Pronouns. 
In Proceedings of 25th Annual Meeting of the 
ACL. 
Dennis Connolly, John D. Burger, and David S. 
Day. 1994. A Machine Learning Approach to 
Anaphoric Reference. In Proceedings of Interna- 
tional Conference on New Methods in Language 
Processing (NEMLAP). 
Marti A. Hearst. 1994. Multi-Paragraph Segmenta- 
tion of Expository Text. In Proceedings of 32nd 
Annual Meeting of the ACL. 
Jerry R. Hobbs. 1976. Pronoun Resolution. Tech- 
nical Report 76-1, Department of Computer Sci- 
ence, City College, City University of New York. 
Megumi Kameyama. 1988. Japanese Zero Pronom- 
inal Binding, where Syntax and Discourse Meet. 
In Papers from the Second International Worksho 
on Japanese Syntax. 
Hans Kamp. 1981. A Theory of Truth and Semantic 
Representation. In J. Groenendijk et al., editors, 
Formal Methods in the Study of Language. Math- 
ematical Centre, Amsterdam. 
Hiromi Nakaiwa and Satoru Ikehara. 1992. Zero 
Pronoun Resolution in a Japanese to English Ma- 
chine Translation Systemby using Verbal Seman- 
tic Attribute. In Proceedings of the Fourth Con- 
ference on Applied Natural Language Processing. 
Rebecca J. Passonneau and Diane J. Litman. 1993. 
Intention-Based Segmentation: Human Reliabil- 
ity and Correlation with Linguistic Cues. In Pro- 
ceedings of 31st Annual Meeting of the ACL. 
J. Ross quinlan. 1993. C~.5: Programs forMachine 
Learning. Morgan Kaufmann Publishers. 
Stephen Soderland and Wendy Lehnert. 1994. 
Corpus-driven Knowledge Acquisition for Dis- 
course Analysis. In Proceedings of AAAI. 
Marilyn Walker, Masayo Iida, and Sharon Cote. 
1994. Japanese Discourse and the Process of Cen- 
tering. Computational Linguistics, 20(2). 
Marilyn A. Walker. 1989. Evaluating Discourse Pro- 
cessing Algorithms. In Proceedings of 27th Annual 
Meeting of the ACL. 
