Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 88–95, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
An Expectation Maximization Approach to Pronoun Resolution
Colin Cherry and Shane Bergsma
Department of Computing Science
University of Alberta
Edmonton, Alberta, Canada, T6G 2E8
{colinc,bergsma}@cs.ualberta.ca
Abstract
We propose an unsupervised Expectation
Maximization approach to pronoun reso-
lution. The system learns from a fixed
list of potential antecedents for each pro-
noun. We show that unsupervised learn-
ing is possible in this context, as the per-
formance of our system is comparable to
supervised methods. Our results indicate
that a probabilistic gender/number model,
determined automatically from unlabeled
text, is a powerful feature for this task.
1 Introduction
Coreference resolution is the process of determin-
ing which expressions in text refer to the same real-
world entity. Pronoun resolution is the important yet
challenging subset of coreference resolution where a
system attempts to establish coreference between a
pronominal anaphor, such as a third-person pronoun
like he, she, it, or they, and a preceding noun phrase,
called an antecedent. In the following example, a
pronoun resolution system must determine the cor-
rect antecedent for the pronouns “his” and “he.”
(1) When the president entered the arena with his
family, he was serenaded by a mariachi band.
Pronoun resolution has applications across many
areas of Natural Language Processing, particularly
in the field of information extraction. Resolving a
pronoun to a noun phrase can provide a new inter-
pretation of a given sentence, giving a Question An-
swering system, for example, more data to consider.
Our approach is a synthesis of linguistic and sta-
tistical methods. For each pronoun, a list of an-
tecedent candidates derived from the parsed corpus
is presented to the Expectation Maximization (EM)
learner. Special cases, such as pleonastic, reflex-
ive and cataphoric pronouns are dealt with linguisti-
cally during list construction. This allows us to train
on and resolve all third-person pronouns in a large
Question Answering corpus. We learn lexicalized
gender/number, language, and antecedent probabil-
ity models. These models, tied to individual words,
can not be learned with sufficient coverage from la-
beled data. Pronouns are resolved by choosing the
most likely antecedent in the candidate list accord-
ing to these distributions. The resulting resolution
accuracy is comparable to supervised methods.
We gain further performance improvement by ini-
tializing EM with a gender/number model derived
from special cases in the training data. This model
is shown to perform reliably on its own. We also
demonstrate how the models learned through our un-
supervised method can be used as features in a su-
pervised pronoun resolution system.
2 Related Work
Pronoun resolution typically employs some com-
bination of constraints and preferences to select
the antecedent from preceding noun phrase candi-
dates. Constraints filter the candidate list of improb-
able antecedents, while preferences encourage se-
lection of antecedents that are more recent, frequent,
etc. Implementation of constraints and preferences
can be based on empirical insight (Lappin and Le-
ass, 1994), or machine learning from a reference-
88
annotated corpus (Ge et al., 1998). The majority
of pronoun resolution approaches have thus far re-
lied on manual intervention in the resolution pro-
cess, such as using a manually-parsed corpus, or
manually removing difficult non-anaphoric cases;
we follow Mitkov et al.’s approach (2002) with a
fully-automatic pronoun resolution method. Pars-
ing, noun-phrase identification, and non-anaphoric
pronoun removal are all done automatically.
Machine-learned, fully-automatic systems are
more common in noun phrase coreference resolu-
tion, where the method of choice has been deci-
sion trees (Soon et al., 2001; Ng and Cardie, 2002).
These systems generally handle pronouns as a subset
of all noun phrases, but with limited features com-
pared to systems devoted solely to pronouns. Kehler
used Maximum Entropy to assign a probability dis-
tribution over possible noun phrase coreference re-
lationships (1997). Like his approach, our system
does not make hard coreference decisions, but re-
turns a distribution over candidates.
The above learning approaches require anno-
tated training data for supervised learning. Cardie
and Wagstaff developed an unsupervised approach
that partitions noun phrases into coreferent groups
through clustering (1999). However, the partitions
they generate for a particular document are not use-
ful for processing new documents, while our ap-
proach learns distributions that can be used on un-
seen data. There are also approaches to anaphora
resolution using unsupervised methods to extract
useful information, such as gender and number (Ge
et al., 1998), or contextual role-knowledge (Bean
and Riloff, 2004). Co-training can also leverage
unlabeled data through weakly-supervised reference
resolution learning (M¨uller et al., 2002). As an alter-
native to co-training, Ng and Cardie (2003) use EM
to augment a supervised coreference system with
unlabeled data. Their feature set is quite different, as
it is designed to generalize from the data in a labeled
set, while our system models individual words. We
suspect that the two approaches can be combined.
Our approach is inspired by the use of EM in
bilingual word alignment, which finds word-to-word
correspondences between a sentence and its transla-
tion. The prominent statistical methods in this field
are unsupervised. Our methods are most influenced
by IBM’s Model 1 (Brown et al., 1993).
3 Methods
3.1 Problem formulation
We will consider our training set to consist of
(p,k,C) triples: one for each pronoun, where p is
the pronoun to be resolved, k is the pronoun’s con-
text, and C is a candidate list containing the nouns p
could potentially be resolved to. Initially, we take k
to be the parsed sentence that p appears in.
C consists of all nouns and pronouns that precede
p, looking back through the current sentence and the
sentence immediately preceding it. This small win-
dow may seem limiting, but we found that a cor-
rect candidate appeared in 97% of such lists in a
labeled development text. Mitkov et al. also limit
candidate consideration to the same window (2002).
Each triple is processed with non-anaphoric pronoun
handlers (Section 3.3) and linguistic filters (Sec-
tion 3.4), which produce the final candidate lists.
Before we pass the (p,k,C) triples to EM, we
modify them to better suit our EM formulation.
There are four possibilities for the gender and num-
ber of third-person pronouns in English: masculine,
feminine, neutral and plural (e.g., he, she, it, they).
We assume a noun is equally likely to corefer with
any member of a given gender/number category, and
reduce each p to a category label accordingly. For
example, he, his, him and himself are all labeled as
masc for masculine pronoun. Plural, feminine and
neutral pronouns are handled similarly. We reduce
the context term k to p’s immediate syntactic con-
text, including only p’s syntactic parent, the parent’s
part of speech, and p’s relationship to the parent, as
determined by a dependency parser. Incorporating
context only through the governing constituent was
also done in (Ge et al., 1998). Finally, each candi-
date in C is augmented with ordering information,
so we know how many nouns to “step over” before
arriving at a given candidate. We will refer to this or-
dering information as a candidate’s j term, for jump.
Our example sentence in Section 1 would create the
two triples shown in Figure 1, assuming the sentence
began the document it was found in.
3.2 Probability model
Expectation Maximization (Dempster et al., 1977) is
a process for filling in unobserved data probabilisti-
cally. To use EM to do unsupervised pronoun reso-
89
his: p = masc k = p’s familyC = arena (0), president (1)
he: p = masc k = serenade pC = family (0), masc (1), arena (2),
president (3)
Figure 1: EM input for our example sentence.
j-values follow each lexical candidate.
lution, we phrase the resolution task in terms of hid-
den variables of an observed process. We assume
that in each case, one candidate from the candidate
list is selected as the antecedent before p and k are
generated. EM’s role is to induce a probability dis-
tribution over candidates to maximize the likelihood
of the (p,k) pairs observed in our training set:
Pr(Dataset) = productdisplay
(p,k)∈Dataset
Pr(p,k) (1)
We can rewrite Pr(p,k) so that it uses a hidden can-
didate (or antecedent) variable c that influences the
observed p and k:
Pr(p,k) = summationdisplay
c∈C
Pr(p,k,c) (2)
Pr(p,k,c) = Pr(p,k|c)Pr(c) (3)
To improve our ability to generalize to future cases,
we use a na¨ıve Bayes assumption to state that the
choices of pronoun and context are conditionally in-
dependent, given an antecedent. That is, once we
select the word the pronoun represents, the pronoun
and its context are no longer coupled:
Pr(p,k|c) = Pr(p|c)Pr(k|c) (4)
We can split each candidate c into its lexical com-
ponent l and its jump value j. That is, c = (l,j).
If we assume that l and j are independent, and that
p and k each depend only on the l component of c,
we can combine Equations 3 and 4 to get our final
formulation for the joint probability distribution:
Pr(p,k,c) = Pr(p|l)Pr(k|l)Pr(l)Pr(j) (5)
The jump term j, though important when resolving
pronouns, is not likely to be correlated with any lex-
ical choices in the training set.
Table 1: Examples of learned pronoun probabilities.
Word (l) masc fem neut plur
company 0.03 0.01 0.95 0.01
president 0.94 0.01 0.03 0.02
teacher 0.19 0.71 0.09 0.01
This results in four models that work together to
determine the likelihood of a given candidate. The
Pr(p|l) distribution measures the likelihood of a pro-
noun given an antecedent. Since we have collapsed
the observed pronouns into groups, this models a
word’s affinity for each of the four relevant gen-
der/number categories. We will refer to this as our
pronoun model. Pr(k|l) measures the probability of
the syntactic relationship between a pronoun and its
parent, given a prospective antecedent for the pro-
noun. This is effectively a language model, grading
lexical choice by context. Pr(l) measures the prob-
ability that the word l will be found to be an an-
tecedent. This is useful, as some entities, such as
“president” in newspaper text, are inherently more
likely to be referenced with a pronoun. Finally,
Pr(j) measures the likelihood of jumping a given
number of noun phrases backward to find the cor-
rect candidate. We represent these models with ta-
ble look-up. Table 1 shows selected l-value entries
in the Pr(p|l) table from our best performing EM
model. Note that the probabilities reflect biases in-
herent in our news domain training set.
Given models for the four distributions above,
we can assign a probability to each candidate in
C according to the observations p and k; that is,
Pr(c|p,k) can be obtained by dividing Equation 5
by Equation 2. Remember that c = (l,j).
Pr(c|p,k) = Pr(p|l)Pr(k|l)Pr(l)Pr(j)summationtext
cprime∈C Pr(p|lprime)Pr(k|lprime)Pr(lprime)Pr(jprime)(6)
Pr(c|p,k) allows us to get fractional counts of
(p,k,c) triples in our training set, as if we had actu-
ally observed c co-occurring with (p,k) in the pro-
portions specified by Equation 6. This estimation
process is effectively the E-step in EM.
The M-step is conducted by redefining our mod-
els according to these fractional counts. For exam-
ple, after assigning fractional counts to candidates
90
according to Pr(c|p,k), we re-estimate Pr(p|l) with
the following equation for a specific (p,l) pair:
Pr(p|l) = N(p,l)N(l) (7)
where N() counts the number of times we see a
given event or joint event throughout the training set.
Given trained models, we resolve pronouns by
finding the candidate ˆc that is most likely for the
current pronoun, that is ˆc = argmaxc∈CPr(c|p,k).
Because Pr(p,k) is constant with respect to c,
ˆc = argmaxc∈CPr(p,k,c).
3.3 Non-anaphoric Pronouns
Not every pronoun in text refers anaphorically to a
preceding noun phrase. There are a frequent num-
ber of difficult cases that require special attention,
including pronouns that are:
• Pleonastic: pronouns that have a grammatical
function but do not reference an entity. E.g. “It
is important to observe it is raining.”
• Cataphora: pronouns that reference a future
noun phrase. E.g. “In his speech, the president
praised the workers.”
• Non-noun referential: pronouns that refer to a
verb phrase, sentence, or implicit concept. E.g.
“John told Mary they should buy a car.”
If we construct them na¨ıvely, the candidate lists
for these pronouns will be invalid, introducing noise
in our training set. Manual handling or removal
of these cases is infeasible in an unsupervised ap-
proach, where the input is thousands of documents.
Instead, pleonastics are identified syntactically us-
ing an extension of the detector developed by Lap-
pin and Leass (1994). Roughly 7% of all pronouns
in our labeled test data are pleonastic. We detect
cataphora using a pattern-based method on parsed
sentences, described in (Bergsma, 2005b). Future
nouns are only included when cataphora are iden-
tified. This approach is quite different from Lap-
pin and Leass (1994), who always include all fu-
ture nouns from the current sentence as candidates,
with a constant penalty added to possible cataphoric
resolutions. The cataphora module identifies 1.4%
of test data pronouns to be cataphoric; in each in-
stance this identification is correct. Finally, we know
of no approach that handles pronouns referring to
verb phrases or implicit entities. The unavoidable
errors for these pronouns, occurring roughly 4% of
the time, are included in our final results.
3.4 Candidate list modifications
It would be possible for C to include every noun
phrase in the current and previous sentence, but per-
formance can be improved by automatically remov-
ing improbable antecedents. We use a standard set of
constraints to filter candidates. If a candidate’s gen-
der or number is known, and does not match the pro-
noun’s, the candidate is excluded. Candidates with
known gender include other pronouns, and names
with gendered designators (such as “Mr.” or “Mrs.”).
Our parser also identifies plurals and some gendered
first names. We remove from C all times, dates, ad-
dresses, monetary amounts, units of measurement,
and pronouns identified as pleonastic.
We use the syntactic constraints from Binding
Theory to eliminate candidates (Haegeman, 1994).
For the reflexives himself, herself, itself and them-
selves, this allows immediate syntactic identification
of the antecedent. These cases become unambigu-
ous; only the indicated antecedent is included in C.
We improve the quality of our training set by re-
moving known noisy cases before passing the set
to EM. For example, we anticipate that sentences
with quotation marks will be problematic, as other
researchers have observed that quoted text requires
special handling for pronoun resolution (Kennedy
and Boguraev, 1996). Thus we remove pronouns
occurring in the same sentences as quotes from the
learning process. Also, we exclude triples where
the constraints removed all possible antecedents, or
where the pronoun was deemed to be pleonastic.
Performing these exclusions is justified for training,
but in testing we state results for all pronouns.
3.5 EM initialization
Early in the development of this system, we were
impressed with the quality of the pronoun model
Pr(p|l) learned by EM. However, we found we could
construct an even more precise pronoun model for
common words by examining unambiguous cases in
our training data. Unambiguous cases are pronouns
having only one word in their candidate list C. This
could be a result of the preprocessors described in
91
Sections 3.3 and 3.4, or the pronoun’s position in
the document. A PrU(p|l) model constructed from
only unambiguous examples covers far fewer words
than a learned model, but it rarely makes poor gen-
der/number choices. Furthermore, it can be obtained
without EM. Training on unambiguous cases is sim-
ilar in spirit to (Hindle and Rooth, 1993). We found
in our development and test sets that, after applying
filters, roughly 9% of pronouns occur with unam-
biguous antecedents.
When optimizing a probability function that is not
concave, the EM algorithm is only guaranteed to
find a local maximum; therefore, it can be helpful
to start the process near the desired end-point in pa-
rameter space. The unambiguous pronoun model
described above can provide such a starting point.
When using this initializer, we perform our ini-
tial E-step by weighting candidates according to
PrU(p|l), instead of weighting them uniformly. This
biases the initial E-step probabilities so that a strong
indication of the gender/number of a candidate from
unambiguous cases will either boost the candidate’s
chances or remove it from competition, depending
on whether or not the predicted category matches
that of the pronoun being resolved.
To deal with the sparseness of the PrU(p|l) dis-
tribution, we use add-1 smoothing (Jeffreys, 1961).
The resulting effect is that words with few unam-
biguous occurrences receive a near-uniform gen-
der/number distribution, while those observed fre-
quently will closely match the observed distribution.
During development, we also tried clever initializers
for the other three models, including an extensive
language model initializer, but none were able to im-
prove over PrU(p|l) alone.
3.6 Supervised extension
Even though we have justified Equation 5 with rea-
sonable independence assumptions, our four mod-
els may not be combined optimally for our pronoun
resolution task, as the models are only approxima-
tions of the true distributions they are intended to
represent. Following the approach in (Och and Ney,
2002), we can view the right-hand-side of Equa-
tion 5 as a special case of:
exp
parenleftBigg
λ1 log Pr(p|l) + λ2 log Pr(k|l)+
λ3 log Pr(l) + λ4 log Pr(j)
parenrightBigg
(8)
where ∀i : λi = 1. Effectively, the log proba-
bilities of our models become feature functions in
a log-linear model. When labeled training data is
available, we can use the Maximum Entropy princi-
ple (Berger et al., 1996) to optimize the λ weights.
This provides us with an optional supervised ex-
tension to the unsupervised system. Given a small
set of data that has the correct candidates indicated,
such as the set we used while developing our unsu-
pervised system, we can re-weight the final models
provided by EM to maximize the probability of ob-
serving the indicated candidates. To this end, we
follow the approach of (Och and Ney, 2002) very
closely, including their handling of multiple correct
answers. We use the limited memory variable met-
ric method as implemented in Malouf’s maximum
entropy package (2002) to set our weights.
4 Experimental Design
4.1 Data sets
We used two training sets in our experiments, both
drawn from the AQUAINT Question Answering
corpus (Vorhees, 2002). For each training set, we
manually labeled pronoun antecedents in a corre-
sponding key containing a subset of the pronouns
in the set. These keys are drawn from a collection
of complete documents. For each document, all pro-
nouns are included. With the exception of the super-
vised extension, the keys are used only to validate
the resolution decisions made by a trained system.
Further details are available in (Bergsma, 2005b).
The development set consists of 333,000 pro-
nouns drawn from 31,000 documents. The devel-
opment key consists of 644 labeled pronouns drawn
from 58 documents; 417 are drawn from sentences
without quotation marks. The development set and
its key were used to guide us while designing the
probability model, and to fine-tune EM and smooth-
ing parameters. We also use the development key as
labeled training data for our supervised extension.
The test set consists of 890,000 pronouns drawn
from 50,000 documents. The test key consists of
1209 labeled pronouns drawn from 118 documents;
892 are drawn from sentences without quotation
marks. All of the results reported in Section 5 are
determined using the test key.
92
4.2 Implementation Details
To get the context values and implement the syntac-
tic filters, we parsed our corpora with Minipar (Lin,
1994). Experiments on the development set indi-
cated that EM generally began to overfit after 2 it-
erations, so we stop EM after the second iteration,
using the models from the second M-step for test-
ing. During testing, ties in likelihood are broken by
taking the candidate closest to the pronoun.
The EM-produced models need to be smoothed,
as there will be unseen words and unobserved (p,l)
or (k,l) pairs in the test set. This is because prob-
lematic cases are omitted from the training set, while
all pronouns are included in the key. We han-
dle out-of-vocabulary events by replacing words or
context-values that occur only once during training
with a special unknown symbol. Out-of-vocabulary
events encountered during testing are also treated
as unknown. We handle unseen pairs with additive
smoothing. Instead of adding 1 as in Section 3.5, we
add δp = 0.00001 for (k,l) pairs, and δw = 0.001
for (p,l) pairs. These δ values were determined ex-
perimentally with the development key.
4.3 Evaluation scheme
We evaluate our work in the context of a fully auto-
matic system, as was done in (Mitkov et al., 2002).
Our evaluation criteria is similar to their resolution
etiquette. We define accuracy as the proportion of
pronouns correctly resolved, either to any coreferent
noun phrase in the candidate list, or to the pleonas-
tic category, which precludes resolution. Systems
that handle and state performance for all pronouns
in unrestricted text report much lower accuracy than
most approaches in the literature. Furthermore, au-
tomatically parsing and pre-processing texts causes
consistent degradation in performance, regardless of
the accuracy of the pronoun resolution algorithm. To
have a point of comparison to other fully-automatic
approaches, note the resolution etiquette score re-
ported in (Mitkov et al., 2002) is 0.582.
5 Results
5.1 Validation of unsupervised method
The key concern of our work is whether enough
useful information is present in the pronoun’s cat-
egory, context, and candidate list for unsupervised
learning of antecedents to occur. To that end, our
first set of experiments compare the pronoun resolu-
tion accuracy of our EM-based solutions to that of a
previous-noun baseline on our test key. The results
are shown in Table 2. The columns split the results
into three cases: all pronouns with no exceptions;
all cases where the pronoun was found in a sentence
containing no quotation marks (and therefore resem-
bling the training data provided to EM); and finally
all pronouns excluded by the second case. We com-
pare the following methods:
1. Previous noun: Pick the candidate from the fil-
tered list with the lowest j value.
2. EM, no initializer: The EM algorithm trained
on the test set, starting from a uniform E-step.
3. Initializer, no EM: A model that ranks candi-
dates using only a pronoun model built from
unambiguous cases (Section 3.5).
4. EM w/ initializer: As in (2), but using the ini-
tializer in (3) for the first E-step.
5. Maxent extension: The models produced by
(4) are used as features in a log-linear model
trained on the development key (Section 3.6).
6. Upper bound: The percentage of cases with a
correct answer in the filtered candidate list.
For a reference point, picking the previous noun be-
fore applying any of our candidate filters receives an
accuracy score of 0.281 on the “All” task.
Looking at the “All” column in Table 2, we see
EM can indeed learn in this situation. Starting from
uniform parameters it climbs from a 40% baseline
to a 60% accurate model. However, the initializer
can do slightly better with precise but sparse gen-
der/number information alone. As we hoped, com-
bining the initializer and EM results in a statistically
significant1 improvement over EM with a uniform
starting point, but it is not significantly better than
the initializer alone. The advantage of the EM pro-
cess is that it produces multiple models, which can
be re-weighted with maximum entropy to reach our
highest accuracy, roughly 67%. The λ weights that
achieve this score are shown in Table 3.
Maximum entropy leaves the pronoun model
Pr(p|l) nearly untouched and drastically reduces the
1Significance is determined throughout Section 5 using Mc-
Nemar’s test with a significance level α = 0.05.
93
Table 2: Accuracy for all cases, all excluding sen-
tences with quotes, and only sentences with quotes.
Method All No“ ” Only“ ”
1 Previous noun 0.397 0.399 0.391
2 EM, no initializer 0.610 0.632 0.549
3 Initializer, no EM 0.628 0.642 0.587
4 EM w/ initializer 0.632 0.663 0.546
5 Maxent extension 0.669 0.696 0.593
6 Upper bound 0.838 0.868 0.754
influence of all other models (Table 3). This, com-
bined with the success of the initializer alone, leads
us to believe that a strong notion of gender/number
is very important in this task. Therefore, we im-
plemented EM with several models that used only
pronoun category, but none were able to surpass the
initializer in accuracy on the test key. One factor
that might help explain the initializer’s success is
that despite using only a PrU(p|l) model, the ini-
tializer also has an implicit factor resembling a Pr(l)
model: when two candidates agree with the category
of the pronoun, add-1 smoothing ensures the more
frequent candidate receives a higher probability.
As was stated in Section 3.4, sentences with quo-
tations were excluded from the learning process be-
cause the presence of a correct antecedent in the can-
didate list was less frequent in these cases. This is
validated by the low upper bound of 0.754 in the
only-quote portion of the test key. We can see that
all methods except for the previous noun heuris-
tic score noticeably better when ignoring those sen-
tences that contain quotation marks. In particular,
the difference between our three unsupervised solu-
tions ((2), (3) and (4)) are more pronounced. Much
of the performance improvements that correspond
to our model refinements are masked in the overall
task because adding the initializer to EM does not
improve EM’s performance on quotes at all. Devel-
oping a method to construct more robust candidate
lists for quotations could improve our performance
on these cases, and greatly increase the percentage
of pronouns we are training on for a given corpus.
Table 3: Weights set by maximum entropy.
Model Pr(p|l) Pr(k|l) Pr(l) Pr(j)
Lambda 0.931 0.056 0.070 0.167
Table 4: Comparison to SVM.
Method Accuracy
Previous noun 0.398
EM w/ initializer 0.664
Maxent extension 0.708
SVM 0.714
5.2 Comparison to supervised system
We put our results in context by comparing our
methods to a recent supervised system. The compar-
ison system is an SVM that uses 52 linguistically-
motivated features, including probabilistic gen-
der/number information obtained through web
queries (Bergsma, 2005a). The SVM is trained
with 1398 separate labeled pronouns, the same train-
ing set used in (Bergsma, 2005a). This data is
also drawn from the news domain. Note the su-
pervised system was not constructed to handle all
pronoun cases, so non-anaphoric pronouns were re-
moved from the test key and from the candidate lists
in the test key to ensure a fair comparison. As ex-
pected, this removal of difficult cases increases the
performance of our system on the test key (Table 4).
Also note there is no significant difference in per-
formance between our supervised extension and the
SVM. The completely unsupervised EM system per-
forms worse, but with only a 7% relative reduction
in performace compared to the SVM; the previous
noun heuristic shows a 44% reduction.
5.3 Analysis of upper bound
If one accounts for the upper bound in Table 2, our
methods do very well on those cases where a cor-
rect answer actually appears in the candidate list: the
best EM solution scores 0.754, and the supervised
extension scores 0.800. A variety of factors result in
the 196 candidate lists that do not contain a true an-
tecedent. 21% of these errors arise from our limited
candidate window (Section 3.1). Incorrect pleonas-
tic detection accounts for another 31% while non-
94
noun referential pronouns cause 25% (Section 3.3).
Linguistic filters (Section 3.4) account for most of
the remainder. An improvement in any of these com-
ponents would result in not only higher final scores,
but cleaner EM training data.
6 Conclusion
We have demonstrated that unsupervised learning is
possible for pronoun resolution. We achieve accu-
racy of 63% on an all-pronoun task, or 75% when
a true antecedent is available to EM. There is now
motivation to develop cleaner candidate lists and
stronger probability models, with the hope of sur-
passing supervised techniques. For example, incor-
porating antecedent context, either at the sentence
or document level, may boost performance. Further-
more, the lexicalized models learned in our system,
especially the pronoun model, are potentially pow-
erful features for any supervised pronoun resolution
system.

References
David L. Bean and Ellen Riloff. 2004. Unsupervised learning
of contextual role knowledge for coreference resolution. In
HLT-NAACL, pages 297–304.
Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della
Pietra. 1996. A maximum entropy approach to natural lan-
guage processing. Computational Linguistics, 22(1):39–71.
Shane Bergsma. 2005a. Automatic acquisition of gender infor-
mation for anaphora resolution. In Proceedings of the 18th
Conference of the Canadian Society for Computational Intel-
ligence (Canadian AI 2005), pages 342–353, Victoria, BC.
Shane Bergsma. 2005b. Corpus-based learning for pronom-
inal anaphora resolution. Master’s thesis, Department
of Computing Science, University of Alberta, Edmonton.
http://www.cs.ualberta.ca/˜bergsma/Pubs/thesis.pdf.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra,
and Robert L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computational
Linguistics, 19(2):263–312.
Claire Cardie and Kiri Wagstaff. 1999. Noun phrase corefer-
ence as clustering. In Proceedings of the 1999 Joint SIGDAT
Conference on Empirical Methods in Natural Language Pro-
cessing and Very Large Corpora, pages 82–89.
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maxi-
mum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, 39(1):1–38.
Niyu Ge, John Hale, and Eugene Charniak. 1998. A statistical
approach to anaphora resolution. In Proceedings of the Sixth
Workshop on Very Large Corpora, pages 161–171.
L. Haegeman. 1994. Introduction to Government & Binding
theory: Second Edition. Basil Blackwell, Cambridge, UK.
Donald Hindle and Mats Rooth. 1993. Structural ambiguity
and lexical relations. Computational Linguistics, 19(1):103–
120.
Harold Jeffreys, 1961. Theory of Probability, chapter 3.23. Ox-
ford: Clarendon Press, 3rd edition.
Andrew Kehler. 1997. Probabilistic coreference in informa-
tion extraction. In Proceedings of the Second Conference on
Empirical Methods in Natural Language Processing, pages
163–173.
Christopher Kennedy and Branimir Boguraev. 1996. Anaphora
for everyone: Pronominal anaphora resolution without a
parser. In Proceedings of the 16th Conference on Compu-
tational Linguistics, pages 113–118.
Shalom Lappin and Herbert J. Leass. 1994. An algorithm for
pronominal anaphora resolution. Computational Linguis-
tics, 20(4):535–561.
Dekang Lin. 1994. Principar - an efficient, broad-coverage,
principle-based parser. In Proceedings of COLING-94,
pages 42–48, Kyoto, Japan.
Robert Malouf. 2002. A comparison of algorithms for max-
imum entropy parameter estimation. In Proceedings of the
Sixth Conference on Natural Language Learning (CoNLL-
2002), pages 49–55.
Ruslan Mitkov, Richard Evans, and Constantin Orasan. 2002.
A new, fully automatic version of Mitkov’s knowledge-poor
pronoun resolution method. In Proceedings of the Third
International Conference on Computational Linguistics and
Intelligent Text Processing, pages 168–186.
Christoph M¨uller, Stefan Rapp, and Michael Strube. 2002. Ap-
plying co-training to reference resolution. In Proceedings
of the 40th Annual Meeting of the Association for Computa-
tional Linguistics, pages 352–359.
Vincent Ng and Claire Cardie. 2002. Improving machine learn-
ing approaches to coreference resolution. In Proceedings of
the 40th Annual Meeting of the Association for Computa-
tional Linguistics, pages 104–111.
Vincent Ng and Claire Cardie. 2003. Weakly supervised nat-
ural language learning without redundant views. In HLT-
NAACL 2003: Proceedings of the Main Conference, pages
173–180.
Franz J. Och and Hermann Ney. 2002. Discriminative training
and maximum entropy models for statistical machine trans-
lation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pages 295–302,
Philadelphia, PA, July.
Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim.
2001. A machine learning approach to coreference resolu-
tion of noun phrases. Computational Linguistics, 27(4):521–
544.
Ellen Vorhees. 2002. Overview of the TREC 2002 question an-
swering track. In Proceedings of the Eleventh Text REtrieval
Conference (TREC).
