Probabilistic Coreference in Information 
Andrew Kehler 
SRI International 
333 Ravenswood Avenue 
Menlo Park, CA 94025 
kehler@ai.sri.com 
Extraction 
Abstract 
Certain applications require that the out- 
put of an information extraction system be 
probabilistic, so that a downstream sys- 
tem can reliably .fuse the output with pos- 
sibly contradictory information from other 
sources. In this paper we consider the 
problem of assigning a probability distri- 
bution to alternative sets of coreference re- 
lationships among entity descriptions. We 
present the results of initial experiments 
with several approaches to estimating such 
distributions in an application using SRI's 
FASTUS information extraction system. 
1 Introduction 
Natural language information extraction (IE) sys- 
tems take texts containing natural language as input 
and produce database templates populated with in- 
formation that is relevant to a particular application. 
These records may be fed as input to a downstream 
system for which the IE system is only one of sev- 
eral sources of information. In such a scenario, the 
downstream system must .fuse the incoming informa- 
tion from each of its sources, requiring the resolution 
of conflicts. To accomplish this, the fusion system 
must know the reliability of the information received 
from each source; in this way unreliable information 
from one source can be disregarded in favor of highly 
reliable information from another. 
Figure 1 exhibits this scenario with a typical IE 
system such as SRI's FASTUS system (Hobbs et al., 
1996). The IE system has two components. The first 
component consists of a series of phases that recog- 
nize domain-relevant patterns in the text and create 
templates representing event and entity descriptions 
from them. The second component merges tem- 
plates created from different phrases in the text that 
overlap in reference. The resulting set of templates 
constitutes a formal description of the state of af- 
fairs as described in the text with respect to the 
application specification, which is then fed to the 
downstream system. 
As part of determining this state of affairs, the 
IE system must create templates describing the rel- 
evant entities that are reported on. This requires 
determining when two or more templates describe 
the same entity, as templates created from corefer- 
ring phrases need to be merged. We have performed 
an informal study of FASTUS's processing of a set 
of texts which indicates that the merging phase is 
where most of the ambiguities (as well as most of 
the errors) lie. However, most IE systems, including 
FASTUS, have pursued a deterministic strategy for 
merging and report only a single possible state of 
affairs. This limitation makes it difficult for a down- 
stream system to fuse the information with possibly 
contradictory information from other sources, as no 
information about the IE system's certainty of the 
results is passed along, nor is information about pos- 
sible alternative states of affairs and their associated 
levels of certainty. 
In this paper, we consider the problem of assign- 
ing a probability distribution to alternative sets of 
coreference relationships among entity descriptions. 
We present the results of initial experiments with 
several approaches to estimating such distributions 
in an application using FASTUS. 
2 Overview of the Problem 
Let us consider an example text of the sort that we 
encounter in our application: 1 
1The texts in our application are messages consisting 
of free text, possibly interspersed with formatted tables 
or charts which themselves may contain natural language 
fragments that require analysis. While this example is 
shorter than most texts in our corpus, the relevant free 
text portions of the messages are typically no longer than 
a few paragraphs. The style displayed in this example is 
fairly typical, although in some cases the sentence struc- 
163 
NL Text 
INFORMATION 
SOURCE 
I INFORMATION EXTRACTION 
PATTERN \] ~\[ TEMPLATE\] RECOGNITION\[-'--"\] MERGING \[ 
INFORMATION ~ 
SOURCE 
DOWNSTREAM 
PROCESSING 
Figure 1: A Scenario Employing an Information Extraction System 
Sub j: Kinston Military Rail Depot 
A rail depot was found 100 km southwest 
of the capitol of Raleigh, consisting of ex- 
tensive admin and support areas (similar 
to the ammunition depot in Fairview), two 
material storage areas, extensive transship- 
ment facilities (some of which are under 
construction immediately east of the de- 
pot), and several training areas. 
We focus on the four mentions of depots in the 
text, which are highlighted with italics. The pat- 
tern matching phases of FASTUS produce templates 
similar to those shown in Figure 2. 
FACILITY DEPOT 
NUMBER 1 
LOCATION KINSTON 
TYPE RAIL 
FACILITY DEPOT - 
NUMBER 1 
TYPE RAIL 
Template A Template B 
FACILITY DEPOT 
NUMBER 1 
LOCATION FAIRVIEW 
TYPE AMMUNITION 
FACILITY 1DEPOT \] 
NUMBER 
Template C Template D 
Figure 2: Templates Representing Depots Men- 
tioned 
We will refer to a set of templates that have po- 
tential coreference relationships among them as a 
ture is more telegraphic. 
coreference set, 2 and possible partitions of corefer- 
ential templates in the set as coreference configura- 
tions. In the coreference set containing templates 
A, B, C, and D, system knowledge external to the 
probabilistic model indicates that the type Ammuni- 
tion in template C is not compatible with the type 
Rail in A and B; therefore these are taken a pri- 
ori to be non-coreferential. Given these incompati- 
bilities, seven possible coreference configurations re- 
main. Template names grouped within parentheses 
are taken to be mutually coreferring; we will refer to 
such a grouping as a cell of the coreference configu- 
ration. 
1. (A B D) (C) 5. (B D) (A) (C) 
2. (A B) (C D) 6. (C D) (A) (S) 
3. (hB) (C) (D) 7. (A) (S) (C) (D) 
4. (A D) (B) (C) 
The first of these configurations expresses the correct 
coreference relationships for the example. 
Given a coreference set of templates, possibly cou- 
pled with a list of template pairs known a priori not 
to corefer, the task is to assign a probability distri- 
bution over the possible coreference configurations 
for that set. 
Relationship to Past Work While there have 
been previous investigations of empirical approaches 
to coreference, these have generally centered on the 
task of assigning correct referents for anaphor/c ex- 
pressions (Connolly, Burger, and Day, 1994; Aone 
and Bennett, 1995; Lappin and Leass, 1994; Dagan 
2Templates A, B, C, and D constitute the only coref- 
erence set in this example, since none of the other NPs 
(e.g., the various "areas" mentioned) are compatible 
with any of the others. In general, however, a text can 
give rise to any number of distinct coreference sets, each 
of which will be assigned its own probability distribution. 
164 
and Itai, 1990; Dagan et al., 1995; Kennedy and 
Boguraev, 1996a; Kennedy and Boguraev, 1996b). 
The current task deviates from that problem in sev- 
eral respects. First, in our task, all coreference re- 
lationships among templates are modeled regardless 
of the "referentiality" of the phrases that led to their 
creation. For instance, indefinites will sometimes 
corefer with a previously described entity; a typ- 
ical case is illustrated by the coreference between 
the indefinite "a rail depot" and the depot intro- 
duced in the subject line in the example passage. 
Also, entities described with bare plurals are com- 
monly found to be coreferential with other entities, 
in addition to cases in which they have their more 
standard generic meanings. On the other hand, def- 
inite noun phrases are often not referential to items 
evoked in the text (e.g., "the ammunition depot in 
Fairview"). Determining when such expressions are 
discourse-anaphoric is part of the task; this informa- 
tion is generally not known to the system a priori. 
Second, the results of this task will be evaluated 
by the probability assigned to the correct state of 
affairs with respect to an entire coreference set, and 
not by the number of correct antecedents assigned 
to anaphoric expressions. Modeling at the level of 
coreference sets ensures that the probabilities are 
consistent when considering the global state of af- 
fairs being described in the text. Furthermore, the 
role of probabilities for this application goes beyond 
selecting the correct coreference relationships - the 
probability assigned to an alternative will be cen- 
tral in determining how the downstream system will 
weigh it against information from other sources dur- 
ing data fusion. A system that assigns a probability 
of 0.9 to correct answers is more successful than one 
that assigns a probability of 0.6 to them. 
The Limitations of IE Systems The properties 
of typical IE systems such as FASTUS also make this 
task challenging. For one, successful modeling of 
coreference relationships is hampered by the crude- 
ness of the representations used. The templates that 
are created are fairly shallow and may be incom- 
plete. A reliance on detailed information about the 
context can prove detrimental if such information 
is often missed by the system. Also, FASTUS also 
does not build up complex representations for the 
syntax and semantics of sentences, placing limits on 
the extent to which such information can be utilized 
in determining coreference. Lastly, there are the in- 
accuracies that result from processing real text. The 
pattern matching phases of FASTUS may intermit- 
tently misanalyze phrases that serve as antecedents 
for subsequent referring expressions. Therefore, for 
example, with respect to an identified coreference 
set, it may be correct to place a referential pronoun 
in its own cell (implying that it does not corefer with 
anything), simply because system error caused its 
antecedent not to be included in the set. 
Outline of the Approach The number of coref- 
erence configurations over which a distribution is to 
be assigned depends on the number of templates in 
the coreference set, and the set of a priori constraints 
against coreference between some of its members. As 
there are many scenarios that will never be encoun- 
tered in a corpus of training data of any reasonable 
size, it would be hopeless to attempt to estimate a 
conditional distribution for each possibility directly. 
To make matters worse, training data comes at a 
cost, as keys have to be coded by hand. One of the 
goals of this effort is to allow the ability to train up 
probabilities in new domains quickly, which requires 
an approach that is successful with a limited amount 
of training data. 
However, it would be reasonable to expect that 
we have enough data to estimate distributions for 
coreference sets with only two members. This sug- 
gests a two-step approach. First, we develop a gen- 
eral model of coreference between any two templates, 
and apply it to pairwise combinations of templates 
in a given coreference set without regard to the other 
templates in the set. We then utilize a method for 
combining the resulting probabilities to form a dis- 
tribution over all the possible coreference configura- 
tions. We describe our method for modeling prob- 
abilities between pairs of templates in the next sec- 
tion, and describe two methods for deriving a dis- 
tribution over the coreference configurations in Sec- 
tion 4. We report on an evaluation and comparison 
of the approaches in Section 5. 
3 Training A Model for Pairs of 
Templates 
Our first task is to derive a model for determining 
the probability that two templates corefer, condi- 
tioned on various characteristics of the context. For 
this we employ an approach to maximum entropy 
modeling described by Berger et al. (1996). 
Maximum Entropy Modeling Suppose we wish 
to model some random process, such as that which 
determines coreference between two templates gen- 
erated by an IE system, based on various character- 
istics of the context that influence this process, such 
as the content of the templates themselves, the form 
of the natural language expressions from which the 
templates were created, and the distance between 
165 
those expressions in the text. We refer to the col- 
lection of such characteristics for a given example as 
its context x, and the value denoting the output of 
the process as y. We can define a set of binary fea- 
tures that relate a possible value of a characteristic 
of x with a possible outcome y, i.e., whether the two 
templates corefer (y = 1) or not (y = 0). For exam- 
ple, a feature fl(x, y) pairing the characteristic of S 
and T having identical slot values with the outcome 
that they corefer would be defined as follows. 
Binary Feature fl(x,Y): 
fl(x,y) = { 
1 if S and T have identical 
slot values and S and T corefer 
0 otherwise 
From these features we can define constraints on 
the probabilistic model that is learned, in which we 
assume that the expected value of the feature with 
respect to the distribution of the training data (Pd) 
holds with respect to the general model (Pro). 
Constraints: 
pal(X, y)f(x, y) = ~ pd(x)pm (ylx)f(X, y) 
X,y X,y 
Given that we have chosen a set of such constraints 
to impose on our model, we wish to identify that 
model which has the maximum entropy - this is 
the model that assumes the least information be- 
yond those constraints. Berger et al. (1996) show 
that this model is a member of an exponential family 
with one parameter for each constraint, specifically 
a model of the form 
1 ~ I~ (x,~) p(yl ) = E' 
in which 
z(x) = eZ, 
Y 
The parameters A1, ..., An are Lagrange multipliers 
that impose the constraints corresponding to the 
chosen features fl, ..-,fn- The term Z(x) normal- 
izes the probabilities by summing over all possible 
outcomes y. Berger et al. (1996) demonstrate that 
the optimal values for the Ai's can be obtained by 
maximizing the likelihood of the training data with 
respect to the model, which can be performed using 
their improved iterative scaling algorithm. 
In practice, we will not want to incorporate con- 
straints for all of the features that we might define, 
but only those that are most relevant and informa- 
tive. Therefore, we use a procedure for selecting 
which of our pool of features should be made active. 
At each iteration, the algorithm approximates the 
gain in the model's predictiveness that would result 
from imposing the constraints corresponding to each 
of the existing inactive features, and selects the one 
with the highest anticipated payoff. Upon making 
this feature active, the Ai's for all active features are 
(re)trained so that the constraints are all met simul- 
taneously. The feature selection process is iterated 
until the approximate gain for all the remaining in- 
active features is negligible. 
Characteristics of Context for Template 
Coreference We now need a set of possible char- 
acteristics of context on which the algorithm could 
choose to conditionalize in deriving the probabilis- 
tic model. For our initial experiments, we uti- 
lized a set of easily computable, but fairly crude, 
characteristics. 3 These characteristics fall into three 
categories. In what follows, we take S and T to be 
arbitrary templates where the natural language ex- 
pression from which T was created appears later in 
the text than the expression from which S was cre- 
ated. 
The first category relates to the contents of the 
templates themselves. We model the relationship 
between S and T as one of the following: S and T 
have identical slot values, S is properly subsumed by 
T, S properly subsumes T, or S and T are otherwise 
consistent. For instance, in our example in Section 2, 
template A is properly subsumed by template B, and 
A, B, and C are all properly subsumed by D, since 
in each case the latter template is more general than 
the former. We also have a binary characteristic for 
S and T having at least two (non-nil) slot values 
in common. Finally, we have a characteristic for 
modeling when the values of the NAME slot of a 
template are both multi-worded and identical; this 
is a crude heuristic for identifying matching unique 
identifiers. 
The second category of characteristics relates to 
the form of reference used in the expression from 
which T was created, specifically whether it was de- 
3One could imagine a variety of more detailed and 
informative characteristics of context than those used 
here. However, in performing these experiments, we are 
interested in how far we can get with a fairly simple 
strategy that will port relatively easily to new domains, 
rather than relying heavily on information that is specific 
to our current domain. A fairly coarse-grained set of 
characteristics also allows us to restrict ourselves to a 
relatively small set of training data; likewise we will not 
want to encode a large set of data for each new domain. 
166 
Template S Template T Probability 
A B 0.671 
A D 0.505 
B D 0.752 
C D 0.504 
Table 1: Pairwise Probabilities for Example Coref- 
erence Set 
scribed with an indefinite phrase, a definite phrase 
(including pronouns), or neither of these (e.g., a 
bare, non-pronominal noun phrase). In the case of 
definite expressions, we also consider the recommen- 
dations of a distinct coreference module within FAS- 
TUS. We have a characteristic representing whether 
the potential antecedent is the preferred antecedent, 4 
a non-preferred, but possible antecedent, or not on 
the list of possible antecedents. 5 
The final category of characteristics relates to the 
distance in the text between the expressions from 
which S and T were created, which we categorize 
as being in one of five equivalence classes: very 
close, close, mid-distance, far away, and very far 
away. These distances are measured crudely (i.e., 
by character length) so as not to be dependent on 
the accuracy of methods for identifying more com- 
plex boundaries (e.g., clause, sentence, and discourse 
segment boundaries). 
The results of training the maximum entropy 
models are discussed in Section 5. To illustrate the 
approaches described in the next section, we will use 
the probabilities for the templates from the example 
passage in Section 2, shown in Table 1, which were 
produced from the parameters induced from one of 
the training sets. 
4 Inferring a Model for Coreference 
Sets 
We now have a method for obtaining a model that 
assigns probabilities to the pairs of templates (hence- 
forth, "pairwise probabilities") in a coreference set 
that can possibly corefer. If there are only two tem- 
plates in the coreference set, then we have the distri- 
4preferred reference is a transitive relation, that is, 
template S is treated as a preferred referent of template 
T if there is a chain of preferred referents linking them, 
e.g., if there is a template R that is the preferred referent 
of T and template S is the preferred referent of R. 
5Although we do not model information about the 
surface positions of the expressions from which S and T 
were created within their respective sentences, the coref- 
erence module does take such information into account 
in determining likely antecedents of definite expressions. 
bution we seek. However, if there are more than two 
templates, we must utilize the pairwise probabili- 
ties to derive a distribution over the members of the 
set of coreference configurations. In the following 
sections, we describe two approaches to recovering 
such a distribution, followed by a description of two 
baseline metrics. An evaluation of these approaches 
is then given in Section 5. 
4.1 An Evidential Reasoning Approach 
The first approach we describe uses the pairwise 
probabilities as sources of evidence that inform the 
choice of model for the coreference sets. The list of 
coreference configurations for our example passage 
are repeated below; we will refer to these configura- 
tions by their corresponding numbers. 
1. (A B D) (C) 5. (B D) (A) (C) 
2. (A B) (C D) 6. (C D) (A) (B) 
3. (AB) (C) (D) 7. (A) (S) (C) (D) 
4. (A D) (B) (C) 
We recast a probability that two templates S and 
T corefer as a mass distribution over two members of 
the power set of coreference configurations, namely 
the set containing exactly those configurations in 
which S and T occupy the same cell, and the set 
containing those in which they do not. For instance, 
the probability that A and B corefer was determined 
to be 0.671; mapping this to corresponding sets of 
coreference configurations results in the mass distri- 
bution mAB in which 
mAB({Configs 1, 2, 3}) = 0.671 
and 
mAB({Configs 4, 5, 6, 7}) = 0.329 
This mass distribution can be seen as representing 
the beliefs of an observer who only has access to 
templates A and B, and who is therefore ignorant 
about their relationship to C and D. We can view 
the other pairwise probabilities for the coreference 
set in the same manner. 
In the best of all worlds, we might identify a model 
that is consistent with the mass distributions pro- 
vided by all the pairwise probabilities. However, 
such a model may not, and often will not, exist. 
This is the case for the pairwise probabilities in our 
example, which can be seen most easily by consider- 
ing only templates A, C, and D. The probability of 
A and D coreferring is 0.505 and of C and D corefer- 
ring is 0.504. Because we know that A and C can- 
not corefer, the coreference configurations in which 
A and D corefer and the configurations in which C 
167 
and D corefer are mutually exclusive. Therefore, 
there would have to be a distribution that assigns 
0.505 of probability mass to a set of configurations 
that is mutually exclusive from a set that is assigned 
0.504 of probability mass. Obviously, this cannot be 
done with a set of probabilities that add up to 1. 
This inconsistency arises from the manner in 
which the pairwise probabilities are estimated. The 
probability of coreference between templates situ- 
ated similarly to A and D may be 0.505 with re- 
spect to all contexts in the training data, however 
it is almost certainly not this high with respect to 
the subset of cases in which a template similar to C 
is similarly situated. The same reasoning applies to 
the probability of C and D coreferring in light of the 
existence of A. Unfortunately, the existence of tem- 
plates other than the pair being modeled is the type 
of conditional information for which we have little 
hope of accounting in a general and statistically sig- 
nificant manner. 
Therefore, we may be left with a series of mass 
distributions defined over sets of coreference configu- 
rations that are in inherent conflict. Instead of view- 
ing these distributions as constraints on the under- 
lying probabilistic model, we view them as sources 
of evidence. The question is then how to take these 
sources into account, given that they may be par- 
tially contradictory. Dempster's Rule of Combina- 
tion (Dempster, 1968) provides a mechanism for do- 
ing this. Dempster's rule combines two mass distri- 
butions m 1 and m 2 to form a third distribution m 3 
that represents the consensus of the original two dis- 
tributions; the new mass distribution in effect leans 
toward the areas of agreement between the origi- 
nal distributions and away from points of conflict. 
Dempster's rule is defined as follows: 
1 E ml(Ai)m2(Aj) m3(Ak) -- 1 -- 
AinAj--Ak 
in which 
~= E ml(Ai)m2(Aj) 
AiNAj----O 
The Al in our case are members of the power set of 
possible coreference configurations. In our example 
above, mAB assigns probability mass to two such 
Am, the set containing configurations 1, 2, and 3, 
and the set containing configurations 4, 5, 6, and 7. 
The value a is called the conflict between the mass 
distributions being combined; it provides a measure 
of the degree of disagreement between them. When 
= 0, the original distributions are compatible; 
when ,¢ = 1, they are in complete conflict and the 
result is undefined. When 0 < ,~ < 1, some conflict 
between the distributions exists; Dempster's rule has 
the effect of focusing on the agreement between the 
distributions by eliminating the conflicting portions 
and normalizing what remains. 
We can therefore use Dempster's Rule to resolve 
the conflict between the pairwise probability distri- 
butions to generate a distribution over the coref- 
erence configurations. Because we have pairwise 
probabilities for each possibly coreferring pair in the 
coreference set, it turns out that the Dempster solu- 
tion is more easily stated and computed here than 
in the general case. The solution is identical to the 
one that results when the probabilities of all the rele- 
vant pairwise relations (indicating either coreference 
or not) are multiplied, normalized by the amount of 
probability mass assigned to coreference configura- 
tions that are impossible because coreference is tran- 
sitive. For instance, the probability for the corefer- 
ence configuration ((A B) (C)) is initially computed 
to be 6 
p(A =c B) * p(A ¢c C) * p(B Pc C) 
However, using this method, impossible combina- 
tions (e.g., A =c B, B =c C, A¢c C) will also re- 
ceive positive probability mass. If we normalize the 
probabilities of possible combinations by distribut- 
ing the sum of the probability assigned to all im- 
possible combinations, the result is the same as that 
gotten by iteratively combining the pairwise distri- 
butions using Dempster's Rule. 
The resulting distribution for our example is: 
1. (A B D) (C) = .383 
2. (A B) (C D) = .184 
3. (A B) (C) (D) = .123 
4. (A D) (B) (C) --.062 
5. (B D) (A) (C) = .125 
6. (C D) (A) (B) = .061 
7. (A) (B) (C) (D) = .061 
In motivating our approach, we noted that we can- 
not expect to have the amount of training data nec- 
essary to directly estimate distributions for all the 
possible scenarios with which we may be confronted. 
Limiting ourselves to modeling probabilities between 
pairs of templates, however, leads to inconsistencies 
because of the failure to take into account the crucial 
information provided by the existence of other com- 
patible templates. Dempster's Rule can be seen as a 
very coarse-grained approach to conditioning on con- 
text in this regard. The contributions of the pairwise 
models are conditioned not on the existence of other 
~We use the notation =c to indicate coreference. 
168 
templates in context, but by virtue of the existence 
of conflicting models derived from those templates. 
For instance; the pairwise probability of coreference 
between C and D was originally 0.504, which might 
be reasonable if those were the only two templates 
generated from the text. 7 However, the probability 
that C and D corefer in the final distribution is only 
0.245, the sum of the probabilities of the two parti- 
tions in which C and D occupy the same cell. This 
adjustment results from the existence of templates A 
and B: the fact that template D has a high probabil- 
ity of coreferring with each, combined with the fact 
that template C is incompatible with each, reduces 
the likelihood that C and D corefer. Therefore, the 
preferences for particular coreferential dependencies 
can change when considering the larger picture of 
possible coreference sets. 
In practice, coreference sets that are significantly 
larger than the one we have considered here can lead 
to an explosive number of possible coreference con- 
figurations. We have implemented simple methods 
for pruning very low probability configurations dur- 
ing processing and for smoothing the resulting distri- 
bution. The latter step is accomplished, when nec- 
essary, by eliminating certain low-probability config- 
urations at the end of processing. The probability 
mass from these configurations is distributed uni- 
formly over all the possible configurations that have 
been eliminated. While this is unlikely to be the 
best strategy for smoothing from the standpoint of 
probabilistic modeling, we are constrained by the 
number of alternatives we can report to the down- 
stream system. Smoothing in this way allows us to 
report only the coreference configurations with non- 
negligible probability, along with a single probability 
that is assigned uniformly to the remainder of the 
possible configurations. 
4.2 A Model Based on Merging Decisions 
The second approach we consider models the like- 
lihood of correctness of decisions that a template 
merger such as the one used in FASTUS would make 
in processing a text. To illustrate, consider the case 
in our example in which the probability of the coref- 
erence configuration ((A B D) (C)) is determined. 
The merger would make the following decisions in 
deriving such a configuration, in which the notation 
"B&A" represents the template that results from 
7Actually this number is lower than it would have 
been, because template B was identified as the preferred 
antecedent for template D instead of template C. If C 
and D were the only two templates generated, then C 
would have been identified as the preferred antecedent, 
thus raising the probability. 
templates A and B having previously been merged. 
1. B =c A? ~ yes 
2. C =c B&A? ~ no 
3. D=cC?~no 
4. D =c B&A? ~ yes 
We therefore model the probability of this coref- 
erence configuration as the product of each of the 
corresponding pairwise probabilities. Since we can- 
not model coreference involving objects that have re- 
sulted from previous (hypothetical) merges - the ap- 
propriate feature values for distance and form of re- 
ferring expression would become unclear - we make 
the following approximation: 
p(X =o Yl~...&Y.) ~ p(x =o y.) 
in which Yn is the most recently created template in 
Yt, ..., Yn. 
Using the probabilities from Table 1, s the prob- 
ability assigned to ((A B D) (C)) would therefore 
be 
p(B =c A) * p(C 7to B) * p(D ~tc C) * p(D ~-c B) = 
0.671 * 1 * (1 - 0.504) * 0.752 = 0.250 
Note that unlike the evidential approach, the proba- 
bility of the pair D and A coreferring does not come 
into play, given that coreference between D and B 
and between B and A has been factored in. 
This approach yields a probabilistic model as 
given, that is, the probabilities sum to 1 without 
normalization. However, in certain circumstances 
the approximation above will generate probability 
mass for an impossible case, specifically when it is 
known a priori that X is incompatible with one of 
the templates Y1,..., Y,~-i. For instance, if templates 
B and C in our example had been compatible (with 
A and C remaining incompatible), then the approxi- 
mation above would assign positive probability mass 
to the coreference configuration ((A B C) (D)), be- 
cause the zero probability of A coreferring with C 
would not come into play. Therefore we modify the 
above approximation to apply only if X and each of 
Y1, ..., Yn-1 are compatible; otherwise, the probabil- 
ity mass assigned is used for normalization. One can 
see that this can only improve the pure form of the 
model. 
Using the pairwise probabilities from Table 1, the 
results of the model as applied to the example are: 
SWe use these probabilities for ease of comparison. 
In reality, the pairwise probabilities for this model were 
trained with an adapted set of training data as ex- 
plained below, and so these numbers axe in actuality a 
bit different. 
169 
1. (A B D) (C) = .250 
2. (A B) (C D) = .338 
3. (A B) (C) (D)= .083 
4. (h D) (S) (C) = .020 
5. (B D) (A)(C)= .123 
6. (C D) (A) (S) = .166 
7. (A) (S) (C) (D) = .020 
4.3 Two Bases of Comparison 
We compared the two learned models with two base- 
line models. First, as an absolute baseline, we com- 
pared the model with the uniform distribution, that 
is, the distribution that assigns equal probability to 
each alternative. We then sought a more challeng- 
ing, yet straightforward baseline. We defined a sim- 
ple, "greedy" approach to merging similar to the 
one used in FASTUS, in which merging of newly- 
created templates is attempted iteratively through 
the prior discourse, starting with the most recently 
produced object. Any unifications that succeed are 
performed. For instance, in the above example, the 
greedy method produces the configuration ((A B) 
(C D)), because A is compatible with B, C is not 
compatible with either, and D is compatible with C 
(with which merging would be attempted before the 
earlier-evoked templates B and A). Alternatively, in 
cases in which all of the templates in a coreference 
set are pairwise compatible, the greedy method will 
produce the configuration in which they are all coref- 
erential. 
We then calculated how often this approach 
yielded the correct results in each training set. We 
distinguished between three values: the percentage 
of correctness for coreference sets of cardinality 2 
(call this P2), the percentage for coreference sets of 
cardinality 3 (call this P3), and the percentage for 
coreference sets of cardinality 4 or more (call this 
P>3). The greedy model was defined such that the 
result of the greedy merging strategy is assigned the 
appropriate probability Pk, with the remainder of 
the probability mass 1 -p} distributed uniformly 
among the remaining possible alternatives. (No al- 
ternatives were included that were a priori known 
to be impossible due to incompatibilities.) 
For instance, in the first training set we describe 
below, p2--.571, p3=.652, and p>3=.344 (the per- 
centage for the whole training corpus was p=.555). 
If there are 4 templates, and 10 coreference con- 
figurations are possible, then the answer derived 
by the greedy strategy would receive probability 
.344, and the remaining 9 alternatives would re- 
ceive probability 1-.3449 = .0729. In the second 
training set we describe below, p2--.646, p3=.600, 
and p>3--.345 (the percentage for the whole train- 
ing corpus was p=.549), and in the third training set, 
p2--.628, p3=.600, and p>3=.280 (the percentage for 
the whole training corpus was p=.523). 
5 Experiments 
5.1 Training the Maximum Entropy 
Models 
For reasons described below, we trained separate 
pairwise probability models for each of the two ap- 
proaches. We ran FASTUS over our development 
corpus, 72 texts of which produced coreference data. 
The texts gave rise to 132 coreference sets, and pro- 
duced characteristics of context for 647 potential 
coreference relationships between pairs of templates. 
We created a key by analyzing the texts and entering 
the correct coreference relationships. 
We created three splits of training and test data. 
In the first split, the training set contained 60 mes- 
sages, giving rise to 110 coreference sets, and the test 
set contained 12 messages, giving rise to 22 corefer- 
ence sets. In the second split, the training set con- 
tained 57 messages, giving rise to 102 coreference 
sets, and the test set contained 15 messages, giving 
rise to 30 coreference sets. The third test set was 
created by combining the first and second test sets. 
The training set contained 47 messages, giving rise 
to 88 coreference sets, and the test set contained 25 
messages (the first two test sets overlapped by two 
messages), which gave rise to 44 coreference sets. 
For training the maximum entropy model, only 
the sets of characteristics of context for pairwise 
coreference are relevant; the number of such sets dif- 
fered between the two approaches as discussed be- 
low. The evaluations were performed on the test sets 
with respect to the final distribution generated for 
the coreference sets, with the result being measured 
in terms of the average cross-entropy between the 
model and the test data. 
Data for the Evidential Model The evidential 
model utilizes the pairwise probabilities between all 
pairs of templates in a coreference set. Therefore, we 
used all such pairs in each training set to train the 
maximum entropy model. In the first training set, 
the 110 coreference sets gave rise to characteristics 
of context for 578 pairs of templates; in the second, 
the 102 coreference sets gave rise to characteristics 
for 581 pairs of templates. In the third training set, 
the 88 coreference sets gave rise to characteristics for 
525 pairs of templates. 
The maximum entropy algorithm selected similar 
sets of features to model in each case. 9 Among the 
9The following features represent the referenced char- 
170 
systems of ,ki values learned, negative values were 
learned for the features in which template S prop- 
erly subsumes template T and in which S and T are 
otherwise consistent. These two features model the 
cases in which template T contains information not 
contained in template S, reflecting the fact that ex- 
pressions referring to the same entity usually do not 
become more specific as the discourse proceeds. A 
positive value was learned for the feature modeling 
cases in which templates S and T had at least two 
identical non-nil slot values, as well as for the feature 
modeling an exact match of complex name values. 
As one might expect, a negative value was learned 
for the case in which template T was created from an 
indefinite expression. A positive value was learned 
for the case in which template T was created from a 
definite expression and S was (perhaps transitively) 
the preferred referent according to the coreference 
module. Interestingly, no value was learned for tem- 
plate S being a possible but non-preferred referent, 
but a small positive value was learned for it not be- 
ing on the list at all - presumably this covers cases 
in which the coreference module fails to identify an 
existing referent. All the distance features except for 
close and mid-distance received negative hi values, 
suggesting that coreference between close and mid- 
distance templates was more likely than coreference 
between templates that were very close, far away, 
and very far away. 
The cross-entropy of the learned model as applied 
to the training data in each case was about 0.80. 
Given that the cross-entropy of the uniform distri- 
bution and the data is 1 (as there are only two pos- 
sible values for the random variable, i.e., S and T 
are coreferent or not), this relatively small reduc- 
tion suggests that the problem has some amount of 
difficulty, which is consistent with the notable lack 
of clear signals of coreference characteristic of the 
texts in our domain. 
Data for the Merging Decision Model Unlike 
the evidential model, the merging decision model 
does not always utilize all of the palrwise probabili- 
ties between pairs in a coreference set. For instance, 
in determining the probability of a coreference con- 
figuration ((A B C)), it does not consider the prob- 
ability assigned to the pair A and C except to check 
that they are compatible. Therefore, the training 
set for the maximum entropy algorithm was pared 
down to only contain those pairs that the merger 
would have considered in deriving the correct coref- 
erence configurations. The resulting data had the 
same coreference sets as the training data for the 
acteristic of context paired with the result of coreference. 
evidential approach, but consisted of characteristics 
of context for 415 template pairs in the first train- 
ing set, 405 pairs in the second training set, and 370 
pairs in the third training set. The features selected 
were similar to those in the training of the evidential 
model. 
The cross-entropies of the learned maximum en- 
tropy models and the training data were notably 
better than those for the evidential model, at about 
0.70 in each case. This improvement is not partic- 
ularly surprising. In the evidential case, the fact 
that all pairs of templates are considered results in 
a certain amount of "washing out" of the data, due 
to redundancy in coreference relationships. For in- 
stance, coreference between two templates that are 
far away might be unlikely if there are no corefer- 
ring expressions between them, but quite likely if 
there are. When just considering the pairwise fea- 
ture sets, these two cases are not distinguished, so 
the resulting probability will be mixed. However, in 
the merging decision case, pairs that are far away 
will not be in the data set if there are coreferring 
expressions between them, and thus the probability 
for coreference at long distances will be diminished. 
The result is a "cleaner" set of data in which clearer 
distinctions may be found, as evidenced by the lower 
cross-entropy achieved. 
5.2 Evaluation Results 
The cross-entropies of the various approaches as ap- 
plied to the three sets of test data are shown in Ta- 
ble 2. The number within parentheses indicates the 
number of times that the coreference set with the 
highest probability was the correct one. As hoped, 
both the evidential and merging decision approaches 
outperformed the uniform and greedy approaches 
with respect to cross-entropyJ ° 
Interestingly, and perhaps surprisingly, the evi- 
dential approach outperformed the merging decision 
model, even though in many respects the latter is 
more natural and elegant. While considering fea- 
ture sets for all pairs may wash out the training 
data for the pairwise probability model somewhat, 
the evidence provided by all pairs appears to more 
than make up for the difference. Given that a goal 
of these experiments is to see how well the strate- 
gies would perform with a fairly crude, easily com- 
putable, and portable set of characteristics of con- 
1°The merging decision approach did not do any better 
than the greedy approach in terms of raw accuracy, and 
in fact did somewhat worse in the third test. Again, how- 
ever, the reduction in cross-entropy is important, as the 
statistics produced by the system will be integrated with 
other probabilistic factors in the downstream system. 
171 
\[I Test Set 1 Test Set2 Test Setsland2 
Uniform I 2.12 (--) 1.76 (--) 2.01 (--) 
Greedy 1.50 (15) I 1.30 (20) 1.41 (30) 
Merging Decision 1.32 (15) 1.13 (20) 1.27 (27) 
Evidential 1.10 (17) 0.89 (21) 1.00 (35) 
Table 2: Initial Evaluation Cross-Entropies 
text, we are encouraged by the results of these ex- 
periments, especially considering the limited amount 
of training data that was available. 
Nonetheless, additional data is necessary to con- 
firm the results of these initial evaluations. Although 
the consistency of the results between the first two 
training/test divisions may suggest that the amount 
of training data is sufficient for the rather coarsely 
grained feature set used, the size of the test sets 
are potentially of concern, which motivated our in- 
clusion of the third training/test division. Despite 
the reduction in training data and corresponding in- 
crease in test data, the results of this experiment 
appear to consistent with the first two. 
There are a variety of characteristics of context 
that one might add to improve the models. For 
instance, one could add a characteristic indicating 
when a template is created from a phrase in a sub- 
ject line or table, as many cases of coreference with 
subsequent indefinite phrases occur in this circum- 
stance. Other types of information about text type, 
text structure, and more finely grained distinctions 
with respect to referential types (e.g., modeling pro- 
nouns differently than other definite NPs) would all 
likely further improve the model, although for some 
of these additional training data would be required 
and more domain and genre dependence may result. 
While this work was motivated by a need to pass 
probabilistic output to a downstream data fusion 
system, these methods can be applied system inter- 
nally also, to supplant existing algorithms for merg- 
ing in IE settings that do not allow for probabilistic 
output. In this scenario, the system simply performs 
the template merging dictated by the most proba- 
ble coreference configuration for a given coreference 
set. However, as noted earlier, the texts in our appli- 
cation are relatively short, and therefore the coref- 
erence sets are usually of manageable size. Signif- 
icantly larger coreference sets can lead to an enor- 
mous number of possible coreference configurations. 
Therefore, to address this task in applications with 
much longer texts, mechanisms beyond those that 
were necessary here will be required for intelligently 
pruning the search space and subsequently smooth- 
ing the distributions. 
6 Conclusions 
Certain applications require that the output of an in- 
formation extraction system be probabilistic, so that 
a downstream system can reliably \]use the output 
with possibly contradictory information from other 
sources. In this paper we considered the problem 
of assigning a probability distribution to alterna- 
tive sets of coreference relationships among entity 
descriptions. We presented the encouraging results 
of initial experiments with several approaches to es- 
timating such distributions in an application using 
SRI's FASTUS information extraction system. We 
would expect further gains from encoding additional 
training data and modeling more informative char- 
acteristics of context. 
Acknowledgments 
The author thanks John Bear, Joshua Goodman, 
and two anonymous reviewers for helpful comments 
and criticisms, and the SRI Message Handler project 
team for their contributions to the system in which 
this work is embedded. This work was supported by 
the Defense Advanced Research Projects Agency un- 
der contract number 4099SCL001 (E-Systems Inc., 
prime contractor). 

References 
Aone, Chinatsu and Scott William Bennett. 1995. 
Evaluating automated and manual acquisition of 
anaphora resolution strategies. In Proceedings of 
the 33rd Annual Meeting of the Association for 
Computational Linguistics (ACL-95), pages 122- 
129, Cambridge, MA, June. 
Berger, Adam, Stephen A. Della Pietra, and Vin- 
cent J. Della Pietra. 1996. A maximum entropy 
approach to natural language processing. Compu- 
tational Linguistics, 22(1):39-71. 
Connolly, Dennis, John D. Burger, and David S. 
Day. 1994. A machine learning approach to 
anaphoric reference. In Proceedings of the Inter- 
national Conference on New Methods in Language 
Processing (NeMLaP). 
Dagan, Ido and Alon Itai. 1990. Automatic acquisi- 
tion of constraints for the resolution of anaphora 
references and syntactic ambiguities. In Proceed- 
ings of the 13th International Conference on Com- 
putational Linguistics (COLING-90), pages 330- 
332. 
Dagan, Ido, John Justenson, Shalom Lappin, Her- 
bert Leass, and Amnon Ribak. 1995. Syntax and 
lexical statistics in anaphora resolution. Applied 
Artificial Intelligence, 9(6):633-644, Nov/Dec. 
Dempster, Arthur P. 1968. A generalization of 
Bayesian inference. Journal of the Royal Statis- 
tical Society, 30:205-247. 
Hobbs, Jerry R., Douglas E. Appelt, John Bear, 
David Israel, Megumi Kameyama, Mark Stickel, 
and Mabry Tyson. 1996. FASTUS: A cascaded 
finite-state transducer for extracting information 
from natural-language text. In Finite State De- 
vices for Natural Language Processing. MIT Press, 
Cambridge, MA. 
Kennedy, Christopher and Branimir Boguraev. 
1996a. Anaphora for everyone: Pronominal 
anaphora resolution without a parser. In Pro- 
ceedings of the 16th International Conference on 
Computational Linguistics (COLING-96). 
Kennedy, Christopher and Branimir Boguraev. 
1996b. Anaphora in a wider context: Track- 
ing discourse referents. In Proceedings of the 
12th European Conference on Artificial Intelli- 
gence (ECAI-96). 
Lappin, Shalom and Herbert Leass. 1994. An algo- 
rithm for pronominal anaphora resolution. Com- 
putational Linguistics, 20(4):535-561. 
