Using Coreference Chains for Text Summarization 
Saliha Azzam and Kevin Humphreys and Robert Gaizauskas 
{s. azzam, k. humphreys, r. gaizauskas }@dcs. shef. ac. uk 
Department of Computer Science, University of Sheffield 
Regent Court, Portobello Road 
Sheffield S1 4DP UK 
Abstract 
We describe the use of coreference chains for the 
production of text summaries, using a variety of 
criteria to select a 'best' chain to represent the 
main topic of a text. The approach has been im- 
plemented within an existing MUC coreference 
system, which constructs a full discourse model 
of texts, including information about changes 
of focus, which can be used in the selection 
of chains. Some preliminary experiments on 
the automatic evaluation of summaries are also 
described, using existing tools to attempt to 
replicate some of the recent SUMMAC manual 
evaluations. 
1 Introduction 
In this paper we report preliminary work which 
explores the use of coreference chains to con- 
struct text summaries. Sparck Jones (1993) has 
described summarization as a two stage process 
of (1) building a representation of the source 
text and (2) generating a summary represent- 
ation from the source representation and pro- 
ducing an output text from this summary rep- 
resentation. Our source representation is a set 
of coreference chains - specifically those chains 
of referring expressions produced by an inform- 
ation extraction system designed to participate 
in the MUC-7 coreference task (DARPA, 1998). 
Our summary representation is a 'best chain', 
selected from the set of coreference chains by 
the application of one or more heuristics. The 
output summary is simply the concatenation of 
(some subset of) sentences from the source text 
which contain one or more expressions occurring 
in the selected coreference chain. 
The intuition underlying this approach is that 
texts are in large measure 'about' some central 
entity, which is effectively the topic, or focus 
of the discourse. This intuition may be false - 
there may be more than one entity of central 
concern, or events or relations between entities 
may be the principal topic of the text. However, 
it is at the very least an interesting experiment 
to see to what extent a principal coreference 
chain can be used to generate a summary. Fur- 
ther, this approach, which we have implemented 
and preliminarily evaluated, could easily be ex- 
tended to allow summaries to be generated from 
(parts of) the best n coreference chains, or from 
event, as well as object , coreference chains. 
The use of document extracts formed from 
coreference chains is not novel. Bagga and 
Baldwin (1998) describe a technique for cross- 
document coreference which involves extracting 
the set of all sentences containing expressions 
in a coreference chain for a specific entity (e.g. 
John Smith) from each of several documents. 
They then employ a thresholded vector space 
similarity measure between these document ex- 
tracts to decide whether the documents are dis- 
cussing the same entity (i.e. the same John 
Smith). Baldwin and Morton (1998) describe 
a query-sensitive (i.e. user-focused) summariz- 
ation technique that involves extracting sen- 
tences from a document which contain phrases 
that corefer with expressions in the query. The 
resulting extract is used to support relevancy 
judgments with respect to the query. 
The use of chains of related expressions in 
documents to select sentences for inclusion in a 
generic (i.e. non-user-focused) summary is also 
not novel. Barzilay and Elhadad (1997) de- 
scribe a technique for text summarization based 
on lexical chains. Their technique, which builds 
on work of Morris and Hirst (1994), and ulti- 
mately Halliday and Hasan (1976) who stressed 
the role of lexical cohesion in text coherence, 
is to form chains of lexical items across a text 
based on the items' semantic relatedness as in- 
77 
dicated by a thesaurus (WordNet in their case). 
These lexical chains serve as their source rep- 
resentation, from which a summary represent- 
ation is produced using heuristics for choosing 
the 'best' lexical chains. From these the sum- 
mary is produced by employing a further heur- 
istic to select the 'best' sentences from each of 
the selected lexical chains. 
The novelty in our work is to combine the 
idea of a document extract based on coreference 
chains with the idea of chains of related expres- 
sions serving to indicate sentences for inclusion 
in a generic summary (though we explore the 
use of coreference between query and text as a 
technique for generating user-focused summar- 
ies as well). 
Returning to Halliday and Hasan, one can 
see how this idea has merit within their frame- 
work. They identify four principal mechanisms 
by which text coherence is achieved - reference, 
substitution and ellipsis, conjunction and lexical 
cohesion. If lexical cohesion is a useful relation 
to explore for getting at the 'aboutness' of a 
text, and hence for generating summaries, then 
so too may reference (separately, or in conjunc- 
tion with lexical cohesion). Indeed, identifying 
chains of coreferential expressions in text has 
certain strengths over identifying chains of ex- 
pressions related merely on lexical semantical 
grounds. For, there is no doubt that common 
reference, correctly identified, directly ties dif- 
ferent parts of a text together - they are liter- 
ally 'about' the same thing; lexical semantic re- 
latedness, as indicated by an external resource, 
can never conclusively establish this degree of 
relatedness, nor indeed can the resource guar- 
antee that semantic relatedness will be found 
when it exists. Further, lexical cohesion tech- 
niques ignore pronomial anaphora, and hence 
their frequency counts of key terms, used both 
for identifying best chains and best sentences 
within best chains, may often be inaccurate, as 
focal referents will often be pronominalised. 
Of course there are drawbacks to a 
coreference-based approach. Lexical cohesion 
relations are relatively easy to compute and do 
not rely on full text processing - this makes 
summarisation techniques based on them rapid 
and robust. Coreference relations tend to re- 
quire more complex techniques to compute. 
Our view, however, is that summarisation re- 
search is still in early stages and that we need 
to explore many techniques to understand their 
strengths and weaknesses in terms of the type 
and quality of the summaries they produce. 
If coreference-based techniques can yield good 
summaries, this will provide impetus to make 
coreference technologies better and faster. 
The basic coreference chain technique we de- 
scribe in this paper yields generic summar- 
ies as opposed to user-focused summaries, as 
these terms have been used in relation to the 
TIPSTER SUMMAC text summarization eval- 
uation exercise (Mani et al., 1998). That is, the 
summaries aim to satisfy a wide readership by 
supplying information about the 'most import- 
ant' entity in the text. But of course this tech- 
nique could also be used to generate summar- 
ies tailored to a user(group) through use with a 
preprocessor that analyzed a user-supplied topic 
description and selected one or more entities 
from the topic description to use in filtering 
coreference chains found in the full source doc- 
ument. 
The rest of this paper is organised as fol- 
lows. In Section 2 we briefly describe the sys- 
tem we use for computing coreference relations. 
Section 3 describes various heuristics we have 
implemented for extracting a 'best' coreference 
chain from the set of coreference chains com- 
puted for a text; and, it discusses how we se- 
lect 'best' sentences to include in the summary 
from those source text sentences containing re- 
ferring expressions in the 'best' chain. Section 
4 presents a simple example and shows the dif- 
ferent summaries that different heuristics pro- 
duce. Section 5 describes the limited evaluation 
we have been able to carry out to date, but more 
importantly introduces what we believe to be a 
novel and interesting way of reusing some of the 
MUC materials for assessing summaries. 
2 Coreference in the LaSIE system 
The LaSIE system (Gaizauskas et al., 1995) has 
been designed as a general purpose IE system 
which can conform to the MUC task specific- 
ations for named entity identification, corefer- 
ence resolution, IE template element and re- 
lation identification, and the construction of 
scenario-specific IE templates. The system has 
a pipeline architecture which processes a text 
one sentence at a time and consists of three prin- 
78 
cipal processing stages: lexical preprocessing, 
parsing plus semantic interpretation, and dis- 
course interpretation. The overall contributions 
of these stages may be briefly described as fol- 
lows (see (Gaizauskas et al., 1995) for further 
details): 
lexical preprocessing reads and tokenises 
the raw input text, performs phrasal 
matching against lists of proper names, 
identifies sentence boundaries, tags the 
tokens with parts-of-speech, performs mor- 
phological analysis; 
parsing and semantic interpretation 
builds lexical and phrasal chart edges in 
a feature-based formalism then does two 
pass chart parsing, pass one with a special 
named entity grammar, pass two with a 
general grammar, and, after selecting a 
'best parse', which may have only partial 
coverage, constructs a predicate-argument 
representation of each sentence; 
discourse interpretation adds the informa- 
tion from the predicate-argument repres- 
entation to a hierarchically structured se- 
mantic net which encodes the system's 
world and domain model, adds additional 
information presupposed by the input, per- 
forms coreference resolution between new 
and existing instances in the world model, 
and adds any information consequent upon 
the new input. 
The domain model is encoded as a hierarchy 
of domain-relevant concept nodes, each with an 
associated attribute-value structure describing 
properties of the concept. As a text is pro- 
cessed, instances of concepts mentioned in a 
text are added to the domain model, populat- 
ing it to become a text-, or discourse-, specific 
model. 
Coreference resolution is carried out by at- 
tempting to merge each newly added instance 
with instances already present in the discourse 
model. The basic mechanism, detailed in Ga- 
izauskas and Humphreys (1997), is to examine, 
for each pair of newly added and existing in- 
stances: semantic type consistency/similarity 
in the concept hierarchy; attribute value con- 
sistency/similarity, and a set of heuristic rules, 
some specific to particular types of anaphora 
such as pronouns, which can act to rule out a 
proposed merge. These rules can refer to vari- 
ous lexical, syntactic, semantic, and positional 
information about instances, and have mainly 
been developed through the analysis of train- 
ing data. A recent addition, however, has been 
the integration of a more theoretically motiv-. 
ated focus-based algorithm for the resolution 
of pronominal anaphora (Azzam et al., 1998). 
This includes the maintenance of a set of fo- 
cus registers within the discourse interpreter, 
to model changes of focus through a text and 
provide additional information for the selection 
of antecedents. 
The discourse interpreter maintains an expli- 
cit representation of coreference chains created 
as a result of instances being merged in the dis- 
course model. Each instance has an associated 
attribute recording its position in the original 
text in terms of character positions. When in- 
stances are merged, the result is a single in- 
stance with multiple positions which, taken to- 
gether, represent a coreference chain. 
3 Coreference Chain Selection 
The summarisation mechanism is implemented 
as an additional module in the LaSIE system. 
It processes all the coreference chains built by 
the discourse interpreter and applies various cri- 
teria, described below, to select a 'best' chain. 
The set of sentences in which entries in the best 
chain occur are then identified, using their char- 
acter positions, and concatenated together to 
act as a summary. 
3.1 Selection Criteria 
The summarisation module implements several 
selection criteria which can be applied either in- 
dependently or in combination, as specified in 
parameters of the module. The current set of 
criteria is not fixed, however, and is simply in- 
tended to represent common intuitive heuristics 
as a starting point for further experimentation. 
Length of Chain This criteria simply prefers 
the chain containing the most entries, 
which represents the most frequently men- 
tioned instance in a text, and so possibly 
the most important instance. In the case 
of several chains of equal maximum length, 
the spread is used in addition to select a 
single chain. 
79 
Spread of Chain This criteria involves the 
calculation of the distance, in byte off- 
sets, between the earliest and latest entry 
in each chain. The chain which spans 
the greatest portion of the original text 
is preferred, corresponding to the intuition 
that an instance mentioned throughout a 
text, rather than just being frequently men- 
tioned in one specific section, may be the 
most important instance. If this criteria 
fails to select a single chain, the length cri- 
teria will be used in addition. 
Start of Chain A further intuition is that in- 
stances mentioned at the start of a text, or 
in a title if present, may be more import- 
ant than those only introduced part way 
through. This criteria acts to require that 
the earliest entry in a chain is within either 
the title or the first paragraph of a text. 
Again, other criteria will be needed to se- 
lect a single chain if several start in the first 
paragraph. This criteria may be more ap- 
propriate for particular text genres, such as 
newswires, than the previous two. 
3.2 Focus Chains 
An additional selection mechanism, which may 
be combined with the above criteria in several 
ways, is the 'reduction' of each coreference chain 
to what may be called a "focus chain". This 
makes use of the focus registers built within 
the discourse interpreter to track changes in the 
main focus of each clause in the text. A focus 
chain is the subset of a coreference chain which 
contains only those mentions of an instance that 
occur as the focus of a clause. For example, 
an indefinite noun phrase occurring as a logical 
subject may be recorded in the focus registers, 
and so be retained in a focus chain, whereas a 
subsequent definite noun phrase embedded in 
an optional prepositional phrase will not be in 
focus and will be removed. 
The criteria listed above can all be applied to 
focus chains in exactly the same way as corefer- 
ence chains. However, the point at which the 
coreference chains are reduced is significant: se- 
lecting the longest coreference chain and sub- 
sequently removing all non-focus entries before 
output may give different results than an initial 
selection of the longest focus chain. The former 
sequence of operations acts to filter the entries 
within the 'best' coreference chain, possibly to 
reduce the final length of a summary, whereas 
the latter selects on the basis of the frequency 
(in combination with the length criteria) or the 
position (with spread) of focus itself. Only the 
latter sequence is considered in any detail below, 
although the implementation does allow experi- 
mentation with the use of focus chains at several 
alternative stages. 
4 Example Output 
This section illustrates the kind of summaries 
produced by the different heuristics, using an 
example from the MUC-6 evaluation corpus of 
newswire articles (much reduced for inclusion 
here by omitting the eight final paragraphs): 
<DOC> 
<HL> Economy: Washington, an Exchange Ally, 
Seems To Be Strong Candidate to Head SEC 
</HL> 
< TXT> 
Consuela Washington, a longtime House staffer 
and an expert in securities laws, is a leading candid- 
ate to be chairwoman of the Securities and Exchange 
Commission in the Clinton administration. 
Ms. Washington, ~ years old, would be the first 
woman and the first black to head the five-member 
commission that oversees the securities markets. 
Ms. Washington's candidacy is being championed 
by several powerful lawmakers including her boss, 
Chairman John Dingell (D., Mich.) of the House 
Energy and Commerce Committee. She currently is 
a counsel to the committee. Ms. Washington and 
Mr. Dingell have been considered allies of the secur- 
ities exchanges, while banks and futures exchanges 
have often fought with them. 
A graduate of Harvard Law School, Ms. Wash- 
ington worked as a lawyer for the corporate finance 
division of the SEC in the late 1970s. She has been 
a congressional staffer since 1979. 
Separately, Clinton transition o~cials said 
that Frank Newman, 50, vice chairman and chief 
financial o~eer of BankAmerica Corp., is expected 
to be nominated as assistant Treasury secretary 
for domestic finance. Mr. Newman, who would be 
giving up a job that pays $1 million a year, would 
oversee the Treasury's auctions of government 
securities as well as banking issues. He would report 
directly to Treasury Secretary-designate Lloyd 
Bentsen. 
(...) 
Using the 'length of chain' criteria selects 
the coreference chain for the person Consuela 
80 
Washington, and the following summary is 
produced: 
<HL> Economy: Washington, an Exchange Ally, 
Seems To Be Strong Candidate to Head SEC </HL> 
Consuela Washington, a longtime House staffer 
and an expert in securities laws, is a leading candid- 
ate to be chairwoman of the Securities and Exchange 
Commission in the Clinton administration. 
Ms. Washington, 44 years old, would be the first 
woman and the first black to head the five-member 
commission that oversees the securities markets. 
Ms. Washington's candidacy is being championed 
by several powerful lawmakers including her boss, 
Chairman John Dingell (D., Mich.) of the House 
Energy and Commerce Committee. She currently is 
a counsel to the committee. Ms. Washington and 
Mr. Dingell have been considered allies of the secur- 
ities exchanges, while banks and futures exchanges 
have often fought with them, 
A graduate of Harvard Law School, Ms. Wash- 
ington worked as a lawyer for the corporate finance 
division o/the SEC in the late 1970s. She has been 
a congressional staffer since 1979. 
Using the 'spread of chain' selects the core- 
ference chain for the person Clinton, used here 
as a noun modifier: 
Consuela Washington, a longtime House staffer 
and an expert in securities laws, is a leading candid- 
ate to be chairwoman of the Securities and Exchange 
Commission in the Clinton administration. 
Separately, Clinton transition officials said 
that Frank Newman, 50, vice chairman and chief 
financial officer of BankAmeriea Corp., is expected 
to be nominated as assistant Treasury secretary for 
domestic finance. 
Restricting the selection to focus chains only, 
the 'length of chain' criteria again chooses 
the Ms. Washington chain, but this does not 
include the elements occurring in phrases where 
Ms. Washington is not in focus. The summary 
is therefore reduced by omitting the non-focus 
mentions: 
<HL> Economy: Washington, an Exchange Ally, 
Seems To Be Strong Candidate to Head SEC </HL> 
Consuela Washington, a longtime House staffer 
and an expert in securities laws, is a leading candid- 
ate to be chairwoman of the Securities and Exchange 
Commission in the Clinton administration. 
Ms. Washington, 44 years old, would be the first 
woman and the first black to head the five-member 
commission that oversees the securities markets. 
She currently is a counsel to the committee. 
A graduate o/Harvard Law School, Ms. Wash- 
ington worked as a lawyer for the corporate finance 
division o/ the SEC in the late 1970s. She has been 
a congressional staffer since 1979. 
This same summary is also obtained when the 
'spread of chain' criteria is applied to the focus 
chains. The Clinton chain selected above does 
not include any focus elements, and the Con- 
suela Washington focus chain is the most spread 
as well as the longest. 
5 Evaluating Summaries 
Evaluating the merit of a summary is a diffi- 
cult task. The most extensive effort to date 
to develop a framework for assessing summar- 
ies has been the TIPSTER SUMMAC evalu- 
ation exercise (Mani et al., 1998). In this sec- 
tion we first review the SUMMAC evaluation 
measures and propose how we might 'simulate' 
these without expensive human judges; then, we 
describe how we have carried out one of these 
simulated evaluations to evaluate the summar- 
ization technique described in the previous sec- 
tions. 
5.1 The SUMMAC Evaluation 
This exercise involved a number of different 
tasks and a number of different evaluation meas- 
ures. The measures divided into extrinsic meas- 
ures - those that ignore the content of the sum- 
mary and assess it solely according to how useful 
it is in enabling an agent to perform some meas- 
urable task - and intrinsic measures - those 
that examine the content of the summary and 
attempt to pass some judgment on it directly. 
In brief, the four SUMMAC tasks were: 
1. Ad Hoc Task The summariser is given 
a topic description and a set of docu- 
ments and produces user-focused summar- 
ies based on the topic description. The 
summaries and the topic description (along 
with some source documents for control) 
are passed to a judge who reads the sum- 
maries passes relevance judgments on them 
with respect to the topic. These judgments 
are scored against 'true' relevance judg- 
ments previously established for the full 
documents with respect to the topic. This 
81 
is an extrinsic evaluation that measures the 
utility of a summarization system at gen- 
erating user-focused summaries capable of 
supporting a relevance judgment. 
2. Categorization Task The summariser is 
given a set of documents only and generates 
a summary of each. The summaries (again 
along with some source documents for con- 
trol) along with five topic descriptions axe 
given to a judge who reads the summar- 
ies and categorizes them with respect to 
the five topics, or "none of the above", if 
none is deemed appropriate. These cat- 
egorization judgments are scored against 
'true' categorization judgments previously 
established for the full documents with re- 
spect to the topics. This is an extrinsic 
evaluation that measures the utility of a 
summarization system at generating gen- 
eric summaries capable of supporting a cat- 
egorization judgment. 
3. Question-Answering Task The summariser 
is given a set of documents and a small 
set of topic descriptions (three in the ac- 
tual evaluation) and produces 'informat- 
ive', user-focused summaries. A judge is 
first given the topic descriptions and for 
each produces a set of topic questions, 
each capturing an 'obligatory' aspect of the 
topic - something that has to be present 
in a document to make it relevant to the 
topic. The judge then reads the set of rel- 
evant, full documents to be used in the eval- 
uation and for each produces an answer key 
that contains answers to each of the topic 
questions. Finally the judge reads the sum- 
maries and scores each against the answer 
key for the document, deciding for each 
topic question whether the summary sup- 
plies a correct, partially correct, or miss- 
ing answer. This is an intrinsic evaluation 
that measures the utility of a summar- 
ization system at generating 'informative' 
user-focused summaries capable of acting 
as surrogates for the original document for 
an agent charged with answering the ques- 
tions implicit in a topic description. 
4. Acceptability Task The summariser is given 
a set of documents only and generates a 
summary of each. The summaries and 
the full documents from which they were 
generated axe given to a judge who reads 
the summaries and documents and passes 
a binary 'goodness' judgment on the sum- 
maries, with no prior guidelines specifying 
what constitutes a good summary. This 
is an intrinsic evaluation that was carried 
out merely to see what sort~ of correlation 
might emerge between the extrinsic evalu- 
ations and naive intrinsic evaluation. 
This is a very _valuable framework, but for 
those without access to the resources to instan- 
tiate the framework, particularly the judging 
functions, it is hard to use it to assess a summar- 
ization system. Thus, we have considered ways 
to automate the judging function and have iden- 
tified several, which while by no means perfect, 
may at least provide some useful information 
about the value of the summaries produced. 
For the ad hoc task, we propose replacing the 
judge with an IR system that given the topic de- 
scription assesses both the summaries and full 
documents for relevance. These judgments may 
then be compared against the 'true' relevance 
judgments. The IR system's ability to judge 
full documents may be taken as a baseline meas- 
ure, and the (predicted) loss in accuracy of that 
system relative to the baseline when assessing 
the summaries should provide a measure of how 
good the summaries are. This will not be an 
absolute measure, of course, but one which may 
serve to compare different summarization sys- 
tems, or different parameter settings of the same 
system. 
Similarly, for the categorization task, a doc- 
ument categorization system could be bench- 
marked against the full documents, and then, 
given this baseline, run against the output of the 
summarisation system. Again, an assessment of 
summaries relative to the baseline would be ob- 
tained which could provide significant insights, 
without the cost of a human judge. 
Finally, for the question answering task, we 
propose that some of the earlier MUC resources 
could be reused. First, we assume that a MUC 
template defines the obligatory aspects of some 
topic description, and synthesize a narrative 
topic description from the template (a sort of 
'novelisation' of the template). As in the pre- 
ceding case an IE system could be benchmarked 
against a set of texts for which filled answer keys 
82 
exist by scoring its output answer keys using the 
MUC scoring software. It could then be run 
over the summaries produced by a summariser 
which has been given the full texts and the syn- 
thesized topic description. As with the previ- 
ous cases, the difference between the benchmark 
computed on the full texts, and scores produced 
by automatically filling templates from the sum- 
maries should provide a measure of the informa- 
tion loss in the summary, and hence an indirect 
measure of its value. 
5.2 An Initial Evaluation 
We have not to date been able to carry out the 
proposed evaluations in full on our summariza- 
tion system. In particular, since the goal of the 
basic summarization model we have implemen- 
ted is to produce generic summaries, only the 
automated categorization evaluation is really 
appropriate. 
However, since the MUC materials and scor- 
ing software were available to us, we were eager 
to attempt some form of an automated ques- 
tion answering evaluation, as described in the 
last section, using the existing LaSIE system. 
We therefore adopted a crude technique to sim- 
ulate topic processing by the generic summar- 
ization system: the topic description (a TREC- 
style narrative) from the MUC-6 management 
succession IE task definition, was prepended, 
as the first paragraph, to each text. We then 
ran our summarisation system using the 'start 
of chain' criteria to select only those coreference 
chains which included a link between the topic 
description and the text. This scenario is sim- 
ilar to Baldwin and Morton (1998) who also use 
coreference between query, or topic description, 
and text to generate a user-focused, 'indicative' 
summary. Our approach, however, differs from 
theirs in that no special purpose mechanism is 
used to relate a query to a text, and also the 
detail of selecting sentences for inclusion in the 
summary is much simpler. Further, our evalu- 
ation is completely different in that they judge 
the summaries in terms of their capacity to sup- 
port a relevancy judgment (i.e. the SUMMAC 
ad hoc task) whereas we evaluate the summaries 
in terms of their capacity tosupport a question 
answering-type task (MUC template filling). 
Our technique was tried for 30 of the MUC-6 
evaluation texts, and summaries produced using 
the length and spread criteria, with and without 
the initial selection of focus chains. The full 
LaSIE MUC-6 IE system was then run with the 
summaries as input, and the resulting extracted 
templates scored. As a baseline, performance 
of the LaSIE system with the full texts as input 
was 45% recall, 64% precision, with 13 of 16 rel- 
evant texts correctly identified, by the produc- 
tion of a filled template, and 1 text proposed 
spuriously. Performance using the summaries 
from each set of criteria was: 
Length of coreference chain: 
6Z recall 81Z precision 
4 texts relevant 
17.7~ of original text word count 
Spread of coreference chain: 
2~ recall 89~ precision 
I text relevant 
15.6~ of original text word count 
Length of focus chain: 
I~ recall 80~ precision 
i text relevant 
6.1~ of original text word count 
Spread of focus chain: 
i~ recall 80~ precision 
i text relevant 
6.1~ of original text word count 
The relevance figures alone give some meas- 
ure of the information loss between the full texts 
and the summaries, but the very low recall in 
all cases suggests that the summaries are really 
not suited to this task. One significant reason 
for this is that the MUC-6 topic description re- 
quires the identification of an event involving 
3 instance types - an organisation, a manage- 
ment post and a person - while the production 
of the summaries is based on the selection of a 
single chain representing a single instance. The 
occurrence of all 3 required instances within the 
sentences of a single chain can therefore be ex- 
pected to be rare. A more fruitful use of the 
topic description would be to produce a sum- 
mary from all chains linking the text to the 
topic, rather than a single 'best' chain. Altern- 
atively, for event-based topics, chains of event 
coreference relations could be used. 
Out of interest, the evaluation was also run 
on summaries produced without the criteria re- 
quiring a link between the topic and the text, 
i.e. the generic summaries aiming to capture 
83 
the single main topic of each text. The sum- 
maries produced are considerably longer than 
with the topic description, but this gives con- 
siderably higher recall for the question answer- 
ing task. However, given that the summariser 
here is attempting to capture the main topic of 
each text, independently of the question topic, 
these results are really a measure of the extent 
to which the text topics and question topic co- 
incide in the corpus, rather than the suitability 
of the generic summaries for the question an- 
swering task. 
Length of coreference chain: 
33~ recall 67~ precision 
10 texts relevant 2 spurious 
87.0~ of original text word count 
Spread of coreference chain: 
30~ recall 65~ precision 
10 texts relevant 2 spurious 
41.5~ of original text word count 
Length of focus chain: 
197, recall 73Y, precision 
8 texts relevant 
19.6~. of original text word count 
Spread of focus chain: 
19~, recall 73Y. precision 
8 texts relevant 
18.2~. of original text word count 
A much more detailed analysis of these initial 
results is required before firm conclusions can 
be drawn, but the methodology does represent 
a useful step towards an automated evaluation 
procedure. 
6 Conclusion 
The use of coreference chains for the production 
of text summaries appears to be a promising 
technique. Further investigation into the most 
appropriate selection criteria for different pur- 
poses is required, and the current implement- 
ation is useful for such experimentation. To 
determine progress, however, automatic eval- 
uation of summaries will be extremely valu- 
able, and we have described initial experiments 
towards this goal using existing tools and re- 
sources. The current results demonstrate there 
may need to be subtle interactions between the 
selection of coreference chains and given topic 
descriptions, such as the use of event chains for 
certain tasks, but the simple selection of a single 
chain may be still be useful as the basis for gen- 
eric summaries. 

References 
S. Azzam, K. Humphreys, and R. Gaizauskas. 1998. 
Evaluating a focus-based approach to anaphora 
resolution. In Proceedings of COLING-ACL'98, 
pages 74-78. 
A. Bagga and B. Baldwin. 1998. Entity-based cross- 
document coreferencing using the vector space 
model. In Proceedings of the COLING-ACL'98 
Joint Conference (the 17th International Confer- 
ence on Computational Linguistics and 36th An- 
nual Meeting of the Association for Computa- 
tional Linguisities, Montreal, August. 
B. Baldwin and T.S. Morton. 1998. Dynamic 
coreference-based summarization. In Proceedings 
of the Conference on Empirical Methods in Nat- 
ural Language Processing (EMNLP'98). 
R. Barzilay and M. Elhadad. 1997. Using lexical 
chains for text summarization. In Proceedings of 
the ACL Workshop on Intelligent Scalable Text 
Summarization, pages 10-17, Madrid, July. 
DARPA: Defense Advanced Research Projects 
Agency. 1998. Proceedings of the Seventh Mes- 
sage Understanding Conference (MUC-7). Avail- 
able at http;//www .muc. saic. com. 
R. Gaizanskas and K. Humphreys. 1997. Quantit- 
ative Evaluation of Coreference Algorithms in an 
Information Extraction System. Technical report 
CS - 97 - 19, Department of Computer Science, 
University of Sheffield. 
R. Gaizauskas, T. Wakao, K Humphreys, H. Cun- 
ningham, and Y. Wilks. 1995. Description of the 
LaSIE system as used for MUC-6. In Proceedings 
of the Sixth Message Understanding Conference 
(MUC-6), pages 207-220. Morgan Kaufmann. 
M.A.K. Halliday and R. Hasan. 1976. Cohesion in 
English. Longman, London. 
I. Mani, D. House, G. Klein, L. Hirschman, L. Obrst, 
T. Firmin, M. Chrzanowski, and B. Sundheim. 
1998. The TIPSTER SUMMAC text summariza- 
tion evaluation: Final report. MITRE Technical 
Report MTR 98W0000138, MITRE. 
J. Morris and G. Hirst. 1994. Lexical cohesion com- 
puted by thesanral relations as an indicator of 
the structure of text. Computational Linguistics, 
17(1):21-45. 
K. Sparck Jones. 1993. What might be in sum- 
mary? In Knorz, Krause, and Womser-Hacker, 
editors, Information Retrieval 93: Von der Mod- 
ellie~ung zur Anwendung, pages 9-26. Universit- 
atsverlag Konstanz. 
