Using Coreference for Question Answering 
Thomas S. Morton 
Department of Computer and Information Science 
University of Pennslyvania 
t sm6rton@cis, upenn, edu 
Abstract 
We present a system which retrieves answers to 
queries based on coreference relationships be- 
tween entities and events in the query and doc- 
uments. An evaluation of this system is given 
which demonstrates that the the amount of in- 
formation that the user must process on aver- 
age, tQ find an answer to their query, is reduced 
by an order of magnitude. 
1 Introduction 
Search engines have become ubiquitous as a 
means for accessing information. When a rank- 
ing of documents is returned by a search en- 
gine the information retrieval task is usually not 
complete. The document, as a unit of informa- 
tion, is often too large for many users informa- 
tion needs and finding information within the 
set of returned documents poses a burden of its 
own. Here we examine a technique for extract- 
ing sentences from documents which attempts 
to satisfy the users information needs by provid- 
ing an answer to the query presented. The sys- 
tem does this by modeling coreference relation- 
ships between entities and events in the query 
and documents. An evaluation of this system is 
given which demonstrates that it performs bet- 
ter than using a standard tf. idf weighting and 
that the amount of information that the user 
must process on average, to find an answer to 
their query, is reduced by an order of magnitude 
over document ranking alone. 
2 Problem Statement 
A query indicates an informational need by the 
user to the search engine. The information re- 
quired may take the form of a sentence or even 
a noun phrase. Here the task is to retrieve the 
passage of text which contains the answer to 
the query from a small collection of documents. 
Sentences are then raaked and presented to the 
user. We only examine queries to which an- 
swers are likely to be stated in a sentence or 
noun phrase since answers which are typically 
longer are can be difficult to annotate reliably. 
This technology differs from the standard docu- 
ment ranking task in that, if successful the user 
will likely not need to examine any of the re- 
trieved documents in their entirety. This also 
differs from the document summarization, pro- 
vided by many Search engines today, in that the 
sentences selected are influenced by the query 
and are selected across multiple documents. 
We view a system such as ours as providing 
a secondary level of processing after a small set 
of documents, which the user believes contain 
the information desired, have been found. This 
first step would likely be provided by a tradi- 
tional search engine, thus this technology serves 
as an enhancement to an existing document re- 
trieval systems rather than a replacement. Ad- 
vancements in document retrieval would only 
help the performance of a system such as ours 
as these improvements would increase the like- 
lihood that the answer to the user's query is in 
one of the top ranked documents returned. 
3 Approach 
A query is viewed as identifying a relation to 
which a user desires a solution. This relation 
will most likely involve events and entities, and 
an answer to this relation will involve the same 
events and entities. Our approach attempts to 
find coreference relationships between the enti- 
ties and events evoked by the query and those 
evoked in the document. Based on these rela- 
tionships, sentences are ranked, and the highest 
ranked sentences are displayed to the user. 
The coreference relationships that are mod- 
eled by this system include identity, part-whole, 
85 
and synonymy relations. Consider the following 
query and answer pairs. 
Query: What did Mark McGwire say 
about child abuse? 
Sentence: "What kills me is that you 
know there are kids over there who 
are being abused or neglected, you 
just don't know which ones" McGwire 
says. 
In the above query answer pair the system at- 
tempts to capture the identity relationship be- 
tween Mark McGwire and McGwire by deter- 
mining that the term McGwire in this sentence 
is coreferent with a mention of Mark McGwire 
earlier in the document.. This allows the sys- 
tem to rank this sentence equivalently to a sen- 
tence mentioning the full name. The system 
also treats the term child abuse as a nominaliza- 
tion which allows it to speculate that the term 
abused in the sentence is a related event. Finally 
the verb neglect occurs frequently within doc- 
uments which contain the verb abuse, which is 
nominalized in the query, so this term is treated 
as a related event. The system does not cur- 
rently have a mechanism which tries to capture 
the relationship between kids and children. 
Query: Why did the U.S. bomb Su- 
dan? 
Sentence: Last month, the United 
States launched a cruise missile at- 
tack against the Shifa Pharmaceuti- 
cal Industries plant in Khartoum, al- 
leging that U.S. intelligence agencies 
have turned up evidence - including 
soil samples - showing that the plant 
was producing chemicals which could 
be used to make VX, a deadly nerve 
gas. 
In this example one of the entity-based relation- 
ships of interest is the identity relationship be- 
tween U.S. and United States. Also of interest is 
the part-whole relationship between Sudan and 
Khartoum, it's capital. Finally the bomb event 
is related to the launch/attack event. The sys- 
tem does not currently have a mechanism which 
tries to capture the relationship between Why 
and alleging or evidence. 
4 Implementation 
The relationships above are captured by a num- 
ber of different techniques which can be placed 
in essentially two categories. The first group 
finds identity relationships between different in- 
vocations of the same entity in a document. The 
second identifies more loosely defined relation- 
ships such as part-whole and synonymy. Each of 
the relationships identified is given a weight and 
based on the weights and relationships them- 
selves sentences are ranked and presented to the 
user. 
4.1 Identity Relationships 
Identity relationships are first determined be- 
tween the string instantiations of entities in sin- 
gle documents. This is done so that the dis- 
course context in which these strings appear 
can be taken into account. The motivation for 
this comes in part from example texts where the 
same last name will be used to refer to differ- 
ent individuals in the same family. This is of- 
ten unambiguous because full names are used in 
previous sentences, however this requires some 
modeling of which entities are most salient in 
the discourse. These relations are determined 
using techniques described in (Baldwin et al., 
1998). 
Another source of identity relationships 
is morphological and word order variations. 
Within noun phrases in the query the sys- 
tem constructs other possible word combina- 
tions which contain the head word of the noun 
phrase. For example a noun phrase such as "the 
photographed little trouper" would be extended 
to include "the photographed trouper", "the lit- 
tle tropper", and "the trouper" as well as vari- 
ations excluding the determiner. Each of the 
variations is given a weight based on the ratio of 
the score that the new shorter term would have 
received if it had appeared in the query and the 
actual noun phrase that occured. The morpho- 
logical roots of single word variations are also 
added to the list a possible terms which refer 
to the entity or event with no additional deduc- 
tion in weighting. Finally query entities which 
are found in an acronym database are added to 
the list of corefering terms as well with a weight 
of 1. 
86 
4.2 Part-Whole and Synonymy 
Relationships 
The system captures part-wt~ole and synonymy 
relationships by examining co-occurrence statis- 
tics between certain classes of words. Specif- 
ically co-occurrence statistics are gathered on 
verbs and nominalization which co-occur much 
more often then one would expect based on 
chance alone. This is also done for proper 
nouns. For each verbal pair or proper noun pair 
the mutual information between the two is com- 
puted as follows: 
I(wl, w2) " " p(Wl' w2) = ,ogtf l)p -C2 )) 
where Wl and w2 are words and an event is de- 
fined as a word occuring in a document. All 
words w2 for which I(wl, w2) exceeds a thresh- 
old where Wl is a query term are added to the 
list of terms with which the query term can be 
referred to. This relationship is given with a 
weight of I(wl, w2)/N where N is a normaliza- 
tion constant. The counts for the mutual infor- 
mation statistics were gathered from a corpus of 
over 62,000 Wall Street Journal articles which 
have been automatically tagged and parsed. 
4.3 Sentence Ranking 
Before sentence ranking begins each entity or 
event in the query is assigned a weight. This 
weight is the sum of inverse document frequency 
measure of the entity or events term based on 
its occurrence in the Wall Street Journal corpus 
described in the previous section. This measure 
is computed as: 
idf (wl ) --lOg(df~wl)) 
where N is the total number of documents in the 
corpus and dr(w1) is the number of documents 
which contain word Wl. Once weighted, the sys- 
tem compares the entities and events evoked by 
the query with the entities and events evoked by 
the document. The comparison is done via sim- 
ple string matching against all the terms with 
which the system has determined an entity or 
event can be referred to. Since these term ex- 
pansions are weighted the score for for a partic- 
ular term w2 and a query term Wl is: 
S(Wl, w2) = idf(wl) x weightwl (W2) 
where weightwl is the weight assigned during 
one of the previous term expansion phases and 
idf is defined above. The weightwl function is 
defined to be 0 for any term w2 for which no 
expansion took place. The score for the a par- 
ticular entity or event in the document with re- 
spect to an entity or event in the query is the 
maximum value of S(Wl,W2) over all values of 
Wl and w2 for that entity or event. A particular 
sentence's score is computed as the sum of the 
scores of the set of entities and events it evokes. 
For the purpose of evaluation a baseline sys- 
tem was also constructed. This system fol- 
lowed a more standard information retrieval ap- 
proach to text ranking described in (Salton, 
1989). Each token in the the query is assigned 
an idf score also based on the same corpus of 
Wall Street Journal articles as used with the 
other system. Query expansion simply con- 
sisted of stemming the tokens using a version Of 
the Porter stemmer and sentences were scored 
as a sum of all matching terms, giving the fa- 
miliar t f . idf measure. 
5 Evaluation 
For the evaluation of the system ten queries 
were selected from a collection of actual queries 
presented to an online search engine. Queries 
were selected based on their expressing the users 
information need clearly, their being likely an- 
swered in a single sentence, and non-dubious in- 
tent. The queries used in this evaluation are as 
follows: 
• Why has the dollar weakened against the 
yen? 
• What was the first manned Apollo mission 
to circle the moon? 
• What virus was spread in the U.S. in 1968? 
• Where were the 1968 Summer Olympics 
held? 
• Who wrote "The Once and Future King"? 
• What did Mark McGwire say about child 
abuse? 
• What are the symptoms of Chronic Fatigue 
Syndrome? 
• What kind of tanks does Israel have? 
• What is the life span of a white tailed deer? 
87 
• Who was the first president of Turkey? 
The information requested by the query was 
then searched for from a data source which was 
considered likely to contain the answer. Sources 
for these experiments include Britannica On- 
line, CNN, and the Web at large. Once a 
promising set of documents were retrieved, the 
top ten were annotated for instances of the an- 
swer to the query. The system was then asked to 
process the ten documents and present a ranked 
listing of sentences. 
System performance is presented below as the 
top ranked sentence which contained an answer 
to the question. A question mark is used to 
indicate that an answer did not appear in the 
top ten ranked sentences. 
Query 
9 
10 
First answer's rank 
Full System Baseline 
2 4 
2 3 
8 6 
2 4 
7 8 
1 3 
4 ? 
? 
1 1 
1 1 
6 Discussion 
Sentence extraction and ranking while similar 
in its information retrieval goals with document 
ranking appears have very different properties. 
While a document can often stand alone in its 
interpretation the interpretation of a sentence is 
very dependent on the context in which it ap- 
pears. The modeling of the discourse gives the 
entity based system an advantage over a token 
based models in situations where referring ex- 
pressions which provide little information out- 
side of their discourse context can be related to 
the query. The most extreme example case of 
this being the use of pronouns. 
The query expansion techniques presented 
here are simplistic compared to many used in 
for information retrieval however they are try- 
ing to capture different phenomenon. Here the 
goal is to capture different lexicalizations of the 
same entities and events. Since short news ar- 
ticles are likely to focus on a small number of 
entities and perhaps a single event or a group of 
related events it is hoped that the co-occurrence 
statistics gathered will reveal good candidates 
for alternate ways in which the query entities 
and events can be lexicalized. 
This work employs many of the techniques 
used by (Baldwin and Morton, 1998) for per- 
forming query based summarization. Here how- 
ever the retrieved information attempts to meet 
the users information needs rather then help- 
ing the user determine whether the entire doc- 
ument being summarized possibly meets that 
need. This system also differs in that it can 
present the user with information from multi- 
ple documents. While query sensitive multi- 
document systems exist (Mani and Bloedorn, 
1998), evaluating such systems for the purpose 
of comparison is difficult. 
Our evaluation shows that the system per- 
forms better than the baseline although the 
baseline performs surprisingly well. We believe 
that this is, in part, due to the lack of any 
notion of recall in the evaluation. While all 
queries were answered by multiple sentences, 
for some queries such as 4,5 and 10 it is not 
clear what benefit the retrieval of additional 
sentences would have. The baseline benefited 
from the fact that at least one of the answers 
typically contained most of the query terms. 
Classifying queries as single answer or multi- 
ple answer, and evaluating them separately may 
provide a sharper distinction in performance. 
Comparing the users task with and with- 
out the system reveals a stark contrast in the 
amount of information needed to be processed. 
On average the system required 290 bytes of 
text to display the answer to the query to the 
user. In contrast, had the user reviewed the 
documents in the order presented by the search 
engine, the answer on average, would appear 
after more than 3000 bytes of text had been 
displayed. 
7 Future Work 
As a preliminary investigation into this task 
many areas of future work were discovered. 
7.1 Term Modeling 
The treatment of entities and events needs to 
be extended to model the nouns which indicate 
events more robustly and to exclude relational 
88 
verbs from consideration as events. A proba- 
bilistic model of pronouns where referents are 
treated as the basis for term expansion should 
also be considered. Another area which requires 
attention is wh-words. Even a simple model 
would likely reduce the space of entities con- 
sidered relevant in a sentence. 
7.2 Tools 
In order to be more effective the models used for 
basic linguistic annotation, specifically the part 
of speech tagger, would need trained on a wider 
class of questions than is available in the Penn 
Treebank. The incorporation of a Name Entity 
Recognizer would provide additional categories 
on which co-occurrence statistics could be based 
and would likely prove helpful in the modeling 
of wh-words. 
7.3 User Interaction 
Finally since many of the system's components 
are derived from unsupervised corpus analysis, 
the system's language models could be updated 
as the user searches. This may better charac- 
terize the distribution of words in the areas the 
user is interested which could improve perfor- 
mance for that user. 
8 Conclusion 
We have presented a system which ranks sen- 
tences such that the answer to a users query 
will be presented on average in under 300 bytes. 
This system does this by finding entities and 
events shared by the query and the documents 
and by modeling coreference relationships be- 
tween them. While this is a preliminary inves- 
tigation and many areas of interest have yet to 
be explored, the reduction in the amount of text 
the user must process, to obtain the answers 
they want, is already dramatic. 

References 
Breck Baldwin and Thomas Morton. 1998. Dy- 
namic coreference-based summarization. In 
Proceedings of the Third Conference on Em- 
pirical Methods in Natural Language Process- 
ing, Granada, Spain, June. 
B. Baldwin, T. Morton, Amit Bagga, 
J. Baldridge, R. Chandraseker, A. Dim- 
itriadis, K. Snyder, and M. Wolska. 1998. 
Description of the UPENN CAMP system 
as used for coreference. In Proceedings of the 
Seventh Message Understanding Conference 
(MUC-7), Baltimore, Maryland. 
Inderjeet Mani and Eric Bloedorn. 1998. Ma- 
chine learning of generic and user-focused 
summarization. In Proceeding of the Fifteenth 
National Conference on Artificial intelligence 
(AAAI-98). 
Gerald Salton. 1989. Automatic text process- 
ing: the transformation, analysis, and re- 
trieval of information by computer. Addison- 
Wesley Publishing Company, Inc. 
