Overview of the University of Pennsylvania's 
TIPSTER Project 
University of Pennsylvania 
Breck Baldwin Thomas S. Morton Amit Bagga 
Institute for Research Department of Computer Department of Computer 
in Cognitive Science and Information Science Science 
University of Pennsylvania University of Pennsylvania Duke University 
{breck, tsmort on, bagga}@unagi, cis. upenn, edu 
Introduction 
CAMP software has been used in a variety of areas, 
but at the end of TIPSTER it finishes as it started- 
as a coreference annotation system. The corefer- 
ence output has been used to participate in MUC- 
6 and MUC-7, served as the foundation for three 
types of summarization engines and been input to 
a cross-document coreference system for names and 
events. This document focuses on the most success- 
ful of these application, a query sensitive summa- 
rization system and a cross-document coreference 
system. 
Dynamic Coreference-Based 
Summarization 
We have developed a query-sensitive text summa- 
rization technology well suited for the task of deter- 
mining whether a document is relevant to a query. 
Enough of the document is displayed for the user 
to determine whether the document should be read 
in its entirety. Evaluations indicate that summaries 
are classified for relevance nearly as well as full doc- 
uments. This approach is based on the concept that 
a good summary will represent each of the topics 
in the query and is realized by selecting sentences 
from the document until all the phrases in the query 
which are represented in the summary are 'covered.' 
A phrase in the document is considered to cover a 
phrase in the query if it is coreferent with it. This 
approach maximizes the space of entities retained 
in the summary with minimal redundancy. The 
software is built upon the CAMP NLP system \[3\]. 
Problem Statement 
Given the relative immaturity of summarization 
technologies and their evaluation, it is worthwhile 
to describe our approach in detail and the prob- 
lems it is intended to solve. An important aspect 
of our technique is that we produce sentence extrac- 
tion summaries which are constructed by selecting 
sentences from the source document. In addition, 
our summaries are focused on providing relevant 
information about a query. We feel that the cur- 
rent state-of-the-art techniques are better equipped 
to produce high quality query-sensitive summaries 
than generic summaries. Our goal is to produce 
'indicative' summaries \[5\] which allow a user to de- 
termine whether the document is relevant to his or 
her query. The summary is not intended to replace 
the document or provide answers to questions di- 
rectly but may have this effect. 
Casting our technology in terms of a product, 
we see the application as an intermediate step be- 
tween viewing entire documents and the output of 
an information retrieval engine. Instead of looking 
at either headlines or an entire document, the user 
would look at the summaries of the documents and 
then decide whether the document merited further 
reading. 
Approach 
We conducted a simple experiment with summaries 
produced in the TIPSTER summarization dry run 
\[8\]. For 5 queries with 200 documents each, we 
took the set of summaries produced by the 6 dry- 
run participants and retained only those summaries 
that were true-positives, i.e., the summary was 
judged 'relevant' and the full document was judged 
'relevant'. Over all the queries, at least one of 
the six systems produced a true-positive summary 
for 96.6% of the documents, although no individ- 
ual system performed nearly at that level. This 
meant that some existing technology produced a 
correct summary for almost every relevant docu- 
ment. Hence we viewed the problem as one of bal- 
ancing the capabilities of our system to behave like 
151 
the amalgamated system implicit in joined output. 
Based on this result we are confident that this class 
of summarization is tractable with current tech- 
nologies and this has strongly motivated our design 
decisions. 
Upon encountering a query like "Reporting 
on possibility of and search for extra-terrestrial 
life/intelligence.", we assume that the user has de- 
fined a class of actions, ideas, and/or entities that 
he or she is interested in. The job of an informa- 
tion retrieval engine is to find instantiations of those 
classes in text documents in some database. We 
view summarization as an additional step in this 
process where we attempt to present the user with 
the smallest collection of sentences in the document 
that instantiate the user specified classes and do 
not mislead the user about the overall content of 
the document. By doing so, we can greatly shorten 
the amount of the document that the user must 
read in order to determine whether the document 
is relevant for the user's needs. 
Just as information retrieval algorithms approx- 
imate document relatedness by examining various 
string matchings between the query and the text, 
we approximate certain classes of coreference be- 
tween the query and the text by examining lin- 
guistic information. These coreference relations in- 
clude identity of reference and part-whole relations 
for nominal and verbal phrases3 This moves us a 
step closer to reasoning at a more appropriate level 
of generalization, for summarization, which is still 
technologically feasible. Below are examples indi- 
cating the classes of relatedness that we are trying 
to capture. 
The identity relation between the query 
and the document 
Noun phrase coreference is the best understood 
class of relations that we compute. For example, 
there is coreference between 'Federal Emergency 
Management Agency' in the query and the acronym 
'FEMA' in the document below: 
Query: What is the main function of the Fed- 
eral Emergency Management Agency 
and the funding level provided to meet emer- 
gencies? 
Document: ...FEMA agrees that "fine- 
tuning" is needed to the 1974 act establishing 
a coordinated federal program to prepare for 
lit is not clear whether more sophisticated anno- 
tations are appropriate for information retrieval, and 
perhaps more to the point, it is not clear that there axe 
sufficient resources to process 2 GB collections of data. 
152 
and respond to hurricanes, tornadoes, storms 
and floods .... 
Since these noun phrases refer to the same entity in 
the world, sentences that mention the organization 
would be particularly valuable in a summary. This 
class of coreference can include people, companies 
and objects such as automobiles or aluminum sid- 
ing. It need not be restricted to proper nouns as 
it is possible to refer to an entity using common 
nouns, i.e. 'the agency' and pronouns. 
Identity also holds between events mentioned in 
the query and document. Sometimes the event 
that a query describes is the best indicator of what 
document should be retrieved, and correspondingly 
what sentences are appropriate for a summary. 
Consider the following: 
Query: A relevant document will provide new 
theories about the 1960's assassination of 
President Kennedy. 
Document: ...The House Assassinations 
Committee concluded in 1978 that Kennedy 
was "probably" assassinated as the result of a 
conspiracy involving a second gunman, a find- 
ing that broke from the Warren Commission's 
belief that Lee Harvey Oswald acted alone in 
Dallas on Nov. 22, 1963 .... 
The noun phrase 'the 1960's assassination' refers 
to an event, which is the same as the one referred 
to in the document with the verb 'assassinated'. 
Note also that there is coreference between 'Presi- 
dent Kennedy' and 'Kennedy' in the document. 
The part-whole relation between the 
query and the document 
In addition to the identity relation, phrases in a 
text which refer to parts of an entity or concept 
mentioned in the query will likely provide useful 
information, and therefore should be included in a 
summary. Finding these relations in in general is 
beyond the scope of this paper, however, our ap- 
proximation of a subclass of these relations proved 
helpful for a number of queries. 
A strong example of the part-whole relation oc- 
curs when a country is mentioned in the query and 
a province or city within that country is mentioned 
in the document. For example: 
Query: Document will discuss efforts by the 
black majority in South Africa to over- 
throw domination by the white minority gov- 
ernment. 
Document: About 90 soldiers have been 
arrested and face possible death sentences 
stemming from a coup attempt in Bo- 
phuthatswana, ... Rebel soldiers staged the 
takeover bid Wednesday, detaining homeland 
President Lucas Mangope .... 
Bophuthatswana is inside South Africa, and sen- 
tences that mention it are clearly good candidates 
for inclusion in a summary. 
We also consider part-whole relations between 
events as in the relation between 'overthrow' and 
'staged' and 'detained'. Those events are sub-parts 
of overthrow events, and as such, sentences that 
contain sub-parts of the events are reasonable can- 
didates for inclusion in summaries. 
Implementation 
The summarization technique was developed within 
the CAMP NLP framework. This system provides 
an integrated environment in which to access many 
levels of linguistic information as well as world 
knowledge. Its main components include: named 
entity recognition, tokenization, sentence detec- 
tion, part-of-speech tagging, morphological analy- 
sis, parsing, argument detection, and coreference 
resolution. Many of the techniques used for these 
tasks perform at or near the state of the art and are 
described in more depth in \[16, 12, 11, 9, 6, 2, 3\]. 
The system produces coreference annotated docu- 
ments which serve as the input to the summariza- 
tion algorithm. 
Relating the query to the document 
The relationships discussed previously are approx- 
imated via a series of associations between tokens 
in the query, headline, and the body of the docu- 
ment. Event references are captured by associating 
verbs or nominalizations in the query with verbs 
and nominalizations in the document. 
Given three verbal forms vl in the query, v2 in 
the document, and v3 in the set of all verbal forms, 
where a verbal form is the morphological root of a 
verb or the verb root corresponding to a nominal- 
ization, vl is associated with v2 if at least one of 
the following criteria are met: 
1. (Vl ¢v2) Ap(vl,v2)/(p(vl)p(v2)) -->5 
2. (vl =v2) A (3v3 7~Vl I p(vl,v3)/p(vl)p(v3) -> 5) 
3. (Vl = v2) A ((subject(vl) = subject(v2)) V 
(object(v1) =object(v2))) 
153 
Here p(vi) is the probability that vi occurs in a doc- 
ument and p(vi, vj) is the probability that vi and 
vj occur in the same document. These probabili- 
ties are based on frequencies gathered from approx- 
imately 45,000 Wall Street Journal articles. Crite- 
rion 1 is a measure of mutual information between 
two verbs. Criterion 2 is used to rule out frequently 
occurring verbs such as "be" and "make". Crite- 
rion 3 allows for verbs which are ruled out by cri- 
terion 2 to be associated when additional context 
is available. This is important since some queries 
only contain verbal forms which are ruled out by 
criterion 2. 
Relationships between proper nouns are made on 
the basis of string matches, acronym matching, and 
dictionary lookup. Acronyms are determined either 
through a table lookup or an appositive construc- 
tion occurring in the document which designates 
the acronym for a specific proper noun. A proper 
noun in the query is considered associated with 
a proper noun in the document if it matches the 
string or acronym of the proper noun in the docu- 
ment or it appears in the definition of the proper 
noun in the document. A reverse dictionary lookup 
often allows cities to be associated with the country 
they are in. 
A token in the query which is a lowercase noun or 
adjective is associated with any token in the doc- 
ument which matches its morphological root and 
part of speech. 
Tokens which occur in the headline are associ- 
ated with tokens in the document body using the 
same criteria as the query, with the exclusion of 
the dictionary lookup. The dictionary lookup was 
excluded because the headline will likely use the 
same lexicalization of a proper noun as that used 
in a document. This is less likely to be the case 
with the query. 
Selecting a sentence 
The associations discussed in the previous section 
are used to rank and select sentences from the doc- 
ument. Every token in the document which is asso- 
ciated with the same token in the query or headline 
is considered to be in the same coreference chain. A 
sentence which contains any token in a given coref- 
erence chain is said to cover that chain. 
The following scores are computed for each sen- 
tence in the document: 
1. The number of coreference chains from the query 
which are covered by the sentence and haven't 
been covered by a previously selected sentence. 
2. The number of noun coreference chains from the 
query which are covered by the sentence and the 
number of verbal terms in the sentence which are 
chained to the query. 
3. The number of coreference chains from the head- 
line which are covered by the sentence and 
haven't been covered by a previously selected 
sentence. 
4. The number of noun coreference chains from the 
headline which are covered by the sentence and 
the number of verbal terms in the sentence which 
are chained to the headline. 
5. The number of coreference chains which are cov- 
ered by the sentence and haven't been covered by 
a previously selected sentence. 
6. The number of noun coreference chains which are 
covered by the sentence. 
7. The index of the sentence in the document; sen- 
tences are sequentially numbered. 
The sentences are sorted based on the above 
scores, where the ith scoring criteria is only consid- 
ered in case of a tie for all criteria less than i. Scores 
1-6 are ranked in descending order while score 7 is 
ranked in ascending order. The top-ranked sen- 
tence is selected, and scores 1, 3, and 5 are recom- 
puted in order to select the next sentence. Selection 
halts when all coreference chains in the query have 
been covered and the summary contains at least 4 
sentences. 
Scores 1 and 2 are used to select sentences which 
are related to the query. Scores 3 and 4 are mo- 
tivated by documents which have 1 or 2 sentences 
which appear related to the query but if presented 
alone would give a false impression of the true con- 
tent of the document. Thus sentences related to the 
headline are presented to provide additional back- 
ground. Consider the following example: 
Query: What evidence is there of paramilitary 
activity in the U.S.? 
Summary: ... Last month the extremists used 
rocket-propelled grenades for the first time in 
three attacks on police and paramilitary units. 
This sentence was selected because it contains to- 
kens which are in coreference chains with tokens 
in the query; however, alone it is potentially mis- 
leading because the place of the attack is not men- 
tioned. This ambiguity is resolved when the follow- 
ing sentence is selected because it is well associated 
with the headline. 
154 
Summary: ...Sikh militants may have ac- 
quired one or two U.S.-made Stinger anti- 
aircraft missiles and hidden them inside the 
Golden Temple, the Sikh faith's holiest shrine, 
Punjab police officials said Saturday .... 
This provides enough background information for 
the reader to realize that the para-military activity 
is not taking place in the U.S. and thus that the 
document is irrelevant to the query. 
Likewise, scores 5 and 6 act similarly to 3 and 
4 for documents which do not contain a headline. 
We found this particularly important for advertise- 
ments which often don't state a product or com- 
pany name in the beginning of the document, but 
will repeat these names numerous times throughout 
the document. 
Generating the summary 
Once sentences have been selected, they are pre- 
sented in the order they occurred in the document. 
Pronouns which do not have a referent in the pre- 
vious sentence of the summary are filled with a 
more descriptive string whenever a referent can be 
determined. If space is of concern, prepositional 
phrases attached to nouns (which are not nominal- 
izations), appositives, conjoined noun phrases and 
relative clauses are removed, provided they contain 
no tokens associated with the query or the head- 
line. Since determining pronoun referents and the 
selection of clauses for removal are subject to er- 
rors, filled pronouns are placed in square brackets 
and removed clauses are replaced with an ellipsis 
to indicate to the reader that the original text has 
been modified. 
Example summary 
An example summary which demonstrates many of 
the features of our system appears below. It has 
been constrained to be approximately 10% of the 
original document length, so it is not representa- 
tive of the summaries used in the evaluation, but 
it contains examples of the of both pronoun filling 
and clause deletion. 
The last sentence in the summary was selected 
first because the tokens "death", "sentence", "kill", 
and "term" were associated with the nominaliza- 
tion "punishment". The stranded pronoun "it" has 
also been filled. Sentence 2 was selected next be- 
cause of the match-up between the verb "is" and 
the object "deterrent" in the document and the 
query. Finally, the first sentence was chosen be- 
cause there is another mention of the prison name 
"Marion" in the document. This summary differs 
from the one generated when the 10% length con- 
straint is not imposed, because some higher ranked 
sentences were passed over since their inclusion 
would have exceeded the length restriction. 
Query: Is there data available to suggest that 
capital punishment is a deterrent to crime? 
Summary: "Marion is basically the end of the 
line," Bogdan said. 
... There is no deterrent ... to keep them from 
doing this again. 
Additionally, \[the pending Senate bill\] would 
create five new death penalty offenses: mur- 
der by a federal inmate serving a life sentence; 
drug kingpins in a continuing criminal enter- 
prise even if no murders occur; drug kingpins 
who try to kill to obstruct justice; drug felons 
who unintentionally kill with aggravated reck- 
lessness; and people who kill with a firearm 
during a violent ... crime. 
Evaluation 
In order to evaluate our summarization algorithm, 
we selected 10 unseen queries from the Text RE- 
trieval Conference (TREC) document collection. 
Summaries were generated for 200 documents, 20 
per query, and assessors 2 were asked to make rele- 
vance judgments based on the summaries. A doc- 
ument was considered relevant if it contained the 
information requested in the query or if the as- 
sessor believed that the full document would likely 
contain this information. The relevance judgments 
were then compared to those made by the TREC 
assessors using the full document. This comparison 
places a summary in one of the following categories: 
• a = judged relevant, full document is relevant 
• b = judged relevant, full document is irrelevant 
• c = judged irrelevant, full document is relevant 
• d = judged irrelevant, full document is irrelevant 
Precision, recall, and accuracy are then computed 
as follows: 
precision = a/(a+b) 
recall = a/(a+c) 
accuracy = (a+d)/(a+b+c+d) 
Compression is computed over the number of 
non-whitespace characters in the summary and the 
original document. Here compression is defined as 
2Each author served as an assessor making judg- 
ments for 100 documents across 10 queries. 
155 
the percentage of the document that was not in- 
cluded in the summary: 
compression = (length,~ ...... t-length ...... ~ ) 
lengthdoeurnent 
The results from our experiment are shown in the 
following table: 
Precision 82.8% 101/(101+21) 
Recall 77.7% 101/(101+29) 
Compression 82.8% (704686-121272)/704686 
Accuracy 75.0% (101+49)/200 
A second evaluation on 910 documents was per- 
formed for \[5\]. These results superficially appear 
significantly worse than those from the initial eval- 
uation however a more careful analysis (provided in 
the discussion section) shows that they are in fact 
similar to the results of the previous evaluation. 
Precision 80.3% 322/(322+79) 
Recall 57.6% 322/(322+237) 
Compression 83.0% 
Accuracy 65.3% (322+272)/910 
Discussion 
We view the results of the first evaluation as 
promising in that they compare favorably with 
inter-assessor consistency using the entire docu- 
ment. \[15\] reports unanimous relevance judgments 
by three assessors for 71.7% of the documents. In- 
terpolating this figure to two assessors yields an 
80.1% agreement figure. Using summaries which 
on average are only 17.2% of the original docu- 
ment, our assessors matched the TREC assessors 
for 75.0% of the documents. 
The second evaluation yielded a much lower re- 
call figure while precision remained comparable. 
This, however, is also the case when the same asses- 
sors judgments on the full documents are compared 
to those of the TREC assessors. These results are 
as follows: 
Precision 83.5% 167/(167+33) 
Recall 63.5% 167/(167+96) 
Compression 100.0% 
Accuracy 69.3% (167+124)/420 
We view these results as favorable as well since our 
accuracy is 65.3% using 17.0% of the document on 
average compared to 69.3% accuracy using the en- 
tire document. The discrepancy between the two 
evaluations appears to be based on the assessors in 
the second evaluation using a stricter criteria for 
relevance than that used by the previous evalua- 
tion's assessors or the TREC assessors. 
It was noted after the first evaluation that dif- 
ferent criteria for relevance accounted for some of 
the disagreement between our assessors and the 
TREC assessors. Many documents considered rele- 
vant were marked as irrelevant due to different no- 
tions of relevance and not because the summary 
failed to provide material on which to base a correct 
decision. These difficulties only hinder the evalua- 
tion of a summary system and not its use in an ap- 
plication, since a user will have a clear idea of his 
or her intentions when determining a document's 
relevance. 
As we mentioned previously, our approach has 
been to balance methods of relating the query to 
sentences in the document. The nearly 100% recall 
of the dry-run summaries encouraged us, and we 
even used the output of those summaries to pro- 
vide a test-bed for evaluating our summaries. Al- 
though we never actively sought to emulate aspects 
of other systems directly, our final algorithm does 
share some basic ideas and approaches from those 
systems. Some of the similarities are listed below: 
In \[4\], they eliminate redundant information from 
summaries by classifying sentences according to 
Maximal Marginal Relevance (MMR). MMR ranks 
text chunks according to their dissimilarity to one 
another. Summaries can then be produced with 
sentences that are maximally dissimilar, thereby 
increasing the likelihood that distinguishing infor- 
mation will be in the summary. One can view our 
coverage requirement for terms in the query as an 
attempt to pick dissimilar sentences from the doc- 
ument. Instead of MMR, we use the fact that a 
sentence which does not contain redundantly re- 
ferring phrases to the query is more highly ranked 
than a sentence that does. 
Our individual sentence scoring algorithm shares 
some properties with \[14\]. Their approach includes 
scores for anaphoric density, string equivalence with 
the title or headline of a document, and position 
of the sentence in the document. However, we do 
not take advantage of overt cues for summary sen- 
tences, such as 'in summary' or 'in conclusion', nor 
do we use temporal information in generating a 
summary. 
Like many systems, we do a form of word ex- 
pansion in attempting to relate the query to the 
document. However, the fact that we restrict ex- 
pansion to proper nouns and verbs and their nom- 
inalizations is notable. We found this limited set 
of expansions restricts the relations between the 
text and the query well and also fits within the 
framework of part-whole relations in coreference. 
We did not consider part-whole relations for com- 
mon nouns, because in practice we have not had 
156 
very good results limiting over-generation in that 
domain. 
In the next section we discuss a novel technology 
for cross document coreference. Like the summa- 
rization system just discussed, it takes within doc- 
ument coreference annotated text, produces sum- 
maries in a very similar form to the above, and 
individuates entities based on the similarity of the 
summaries produced. 
Cross-document Coreference 
Cross-document coreference occurs when the same 
person, place, event, or concept is discussed in more 
than one text source. Computer recognition of this 
phenomenon is important because it helps break 
"the document boundary" by allowing a user to 
examine information about a particular entity from 
multiple text sources at the same time. In partic- 
ular, resolving cross-document coreferences allows 
a user to identify trends and dependencies across 
documents. Cross-document coreference can also 
be used as the central tool for producing summaries 
from multiple documents, and for information fu- 
sion, both of which have been identified as advanced 
areas of research by the TIPSTER Phase III pro- 
gram. Cross-document coreference was also iden- 
tified as one of the potential tasks for the Sixth 
Message Understanding Conference (MUC-6) but 
was not included as a formal task because it was 
considered too ambitious \[10\]. 
In this paper we describe a highly success- 
ful cross-document coreference resolution algorithm 
which uses the Vector Space Model to resolve am- 
biguities between people having the same name. In 
addition, we also describe a scoring algorithm for 
evaluating the cross-document coreference chains 
produced by our system and we compare our algo- 
rithm to the scoring algorithm used in the MUC-6 
(within document) coreference task. 
Cross-Document Coreference: The 
Problem 
Cross-document coreference is a distinct technol- 
ogy from Named Entity recognizers like IsoQuest's 
NetOwl and IBM's Textract because it attempts 
to determine whether name matches are actually 
the same individual (not all John Smiths are the 
same). Neither NetOwl or Textract have mecha- 
nisms which try to keep same-named individuals 
distinct if they are different people. 
Cross-document coreference also differs in sub- 
stantial ways from within-document coreference. 
Within a document there is a certain amount of 
consistency which cannot be expected across docu- 
ments. In addition, the problems encountered dur- 
ing within document coreference are compounded 
when looking for coreferences across documents be- 
cause the underlying principles of linguistics and 
discourse context no longer apply across docu- 
ments. Because the underlying assumptions in 
cross-document coreference are so distinct, they re- 
quire novel approaches. 
Architecture and the Methodology 
Figure 1 shows the architecture of the cross- 
document system developed. The system is built 
upon the University of Pennsylvania's within doc- 
ument coreference system, CAMP, which partici- 
pated in the Seventh Message Understanding Con- 
ference (MUC-7) within document coreference task. 
Our system takes as input the coreference pro- 
cessed documents output by CAMP. It then passes 
these documents through the SentenceExtractor 
module which extracts, for each document, all the 
sentences relevant to a particular entity of inter- 
est. The VSM-Disambiguate module then uses a 
vector space model algorithm to compute similari- 
ties between the sentences extracted for each pair 
of documents. 
Details about each of the main steps of the cross- 
document coreference algorithm are given below. 
• First, for each article, CAMP is run on the ar- 
ticle. It produces coreference chains for all the 
entities mentioned in the article. For example, 
consider the two extracts in Figures 2 and 4. The 
coreference chains output by CAMP for the two 
extracts are shown in Figures 3 and 5. 
• Next, for the coreference chain of interest within 
each article (for example, the coreference chain 
that contains "John Perry"), the Sentence Ex- 
tractor module extracts all the sentences that 
contain the noun phrases which form the coref- 
erence chain. In other words, the SentenceEx- 
tractor module produces a "summary" of the ar- 
ticle with respect to the entity of interest. These 
summaries are a special case of the query sen- 
sitive techniques being developed at Penn using 
CAMP. Therefore, for doc.36 (Figure 2), since 
at least one of the three noun phrases ("John 
Perry," "he," and "Perry") in the coreference 
chain of interest appears in each of the three sen- 
tences in the extract, the summary produced by 
SentenceExtractor is the extract itself. On the 
other hand, the summary produced by Sentence- 
Extractor for the coreference chain of interest in 
John Perry, of Weston Golf Club, an- 
nounced his resignation yesterday. He was the 
President of the Massachusetts Golf Associa- 
tion. During his two years in o/rice, Perry 
guided the MGA into a closer relationship 
with the Women's Golf Association of Mas- 
sachusetts. 
Figure 2: Extract from doc.36 
) I I I ® , 
I 
Figure 3: Coreference Chains for doc.36 
doc.38 is only the first sentence of the extract be- 
cause the only element of the coreference chain 
appears in this sentence. 
For each article, the VSM-Disambiguate mod- 
ule uses the summary extracted by the Sentence- 
Extractor and computes its similarity with the 
summaries extracted from each of the other ar- 
ticles. Summaries having similarity above a cer- 
tain threshold are considered to be regarding the 
same entity. 
University of Pennsylvania's CAMP 
System 
The University of Pennsylvania's CAMP system 
resolves within document coreferences for several 
different classes including pronouns, and proper 
names \[7\]. It ranked among the top systems in the 
coreference task during the MUC-6 and the MUC-7 
evaluations. 
The coreference chains output by CAMP enable 
us to gather all the information about the entity of 
interest in an article. This information about the 
entity is gathered by the SentenceExtractor module 
and is used by the VSM-Disambiguate module for 
disambiguation purposes. Consider the extract for 
doc.36 shown in Figure 2. We are able to include 
the fact that the John Perry mentioned in this ar- 
ticle was the president of the Massachusetts Golf 
Association only because CAMP recognized that 
the "he" in the second sentence is coreferent with 
"John Perry" in the first. And it is this fact which 
actually helps VSM-Disambiguate decide that the 
157 
Coreference Chains for doc.01 
I ~11~~ pUe~ ~.rg: httYc~fr eP~el~: ~lYc ~ sa ~:~ :Sm Core f~nc~h~for doc.02 --'~'~' 
Cross-Document Coreference Chains 
I 
i 
VSM- \] Disambiguate 
'01 { 
summary.nn I ~ 
SentenceExtractor 
Figure 1: Architecture of the Cross-Document Coreference System 
Oliver "Biff" Kelly of Weymouth succeeds 
John Perry as president of the Massachusetts 
Golf Association. "We will have continued 
growth in the future," said Kelly, who will 
serve for two years. "There's been a lot of 
changes and there will be continued changes 
as we head into the year 2000." 
Figure 4: Extract from doc.38 
I I • 
' , 
I 
I 
I 
Figure 5: Coreference Chains for doc.38 
two John Perrys in doc.36 and doc.38 are the same 
person. 
The Vector Space Model 
The vector space model used for disambiguating 
entities across documents is the standard vector 
space model used widely in information retrieval 
\[13\]. In this model, each summary extracted by 
158 
the SentenceExtractor module is stored as a vector 
of terms. The terms in the vector are in their mor- 
phological root form and are filtered for stop-words 
(words that have no information content like a, the, 
of, an, ... ). If $1 and $2 are the vectors for the two 
summaries extracted from documents D1 and D2, 
then their similarity is computed as: 
Sire(S1, S2) = ~ w~ x w~j 
common terms tj 
where tj is a term present in both $1 and $2, Wlj 
is the weight of the term tj in S~ and w2j is the 
weight of tj in $2. 
The weight of a term tj in the vector Si for a 
summary is given by: 
t f × log 2 q_ 2 
wij = x/s~ l + si2 ...+sir 
where tff is the frequency of the term tj in the sum- 
mary, N is the total number of documents in the 
collection being examined, and df is the number of 
documents in the collection that the term tj occurs 
2 2 is the cosine normaliza- in. x/s~x + Sis "q-... + Sin 
tion factor and is equal to the Euclidean length of 
the vector Si. 
The VSM-Disambiguate module, for each sum- 
mary Si, computes the similarity of that summary 
with each of the other summaries. If the similar- 
ity computed is above a pre-defined threshold, then 
the entity of interest in the two summaries are con- 
sidered to be coreferent. 
Experiments 
The cross-document coreference system was tested 
on a highly ambiguous test set which consisted of 
197 articles from 1996 and 1997 editions of the 
New York Times. The sole criteria for including 
an article in the test set was the presence or the 
absence of a string in the article which matched 
the "/John.*?Smith/" regular expression. In other 
words, all of the articles either contained the name 
John Smith or contained some variation with a mid- 
dle initial/name. The system did not use any New 
York Times data for training purposes. The an- 
swer keys regarding the cross-document chains were 
manually created, but the scoring was completely 
automated. 
Analysis of the Data 
There were 35 different John Smiths mentioned in 
the articles. Of these, 24 of them only had one ar- 
ticle which mentioned them. The other 173 articles 
were regarding the 11 remaining John Smiths. The 
background of these John Smiths , and the num- 
ber of articles pertaining to each, varied greatly. 
Descriptions of a few of the John Smiths are: 
Chairman and CEO of General Motors, assistant 
track coach at UCLA, the legendary explorer, and 
the main character in Disney's Pocahontas, former 
president of the Labor Party of Britain. 
Results 
Figure 6 shows the precision, recall, and F-Measure 
(with equal weights for both precision and recall) 
using the B-CUBED scoring algorithm. The Vec- 
tor Space Model in this case constructed the space 
of terms only from the summaries extracted by 
SentenceExtractor. In comparison, Figure 7 shows 
the results (using the B-CUBED scoring algorithm) 
when the vector space model constructed the space 
of terms from the articles input to the system (it 
still used the summaries when computing the simi- 
larity). The importance of using CAMP to extract 
summaries is verified by comparing the highest F- 
Measures achieved by the system for the two cases. 
The highest F-Measure for the former case is 84.6% 
while the highest F-Measure for the latter case is 
78.0%. In comparison, for this task, named-entity 
159 
100 
90 
8O 
70 
g so 
~ 40 
30 
20 
10 
0 
Precision/Recall vs Threshold 
. "'~,. O ru AIg: Precision 
\~ OurAIg: Recall -~- . ~, O rAu Ig.'F-Measure -m--. 
m \',, 
. / ~,~t 
,: ~i~1%. ~ '~"13" B"D"ET" B" "E}" G" "E~ ""B'" B "'~"E~"" t3 "'El 
I I I I I I I I I 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Threshold 
Figure 6: Precision, Recall, and F-Measure Using 
the B-CUBED Algorithm With Training On the 
Summaries 
Precision/Recall vs Threshold 
100 .. 
90 L "',.. # Our Alg: Precision --~ F ~'. / Ou(AIg: Recall -~-- 
80 t ~u Ig." F-Measure -B--,: ". 
50 I-/'~/ '~'~ - ""EL "B"" '{:~. B ........ 
30 ~7 / --"~'"+'-+--+--+--+---i---~--~.... - -+- -.k- -+_ _+,..~.1. ~ + 
lO 
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0,9 
Threshold 
Figure 7: Precision, Recall, and F-Measure Using 
the B-CUBED Algorithm With Training On Entire 
Articles 
tools like NetOwl and Textract would mark all the 
John Smiths the same. Their performance using 
our scoring algorithm is 23% precision, and 100% 
recall. 
Figures 8 and 9 show the precision, recall, and 
F-Measure calculated using the MUC scoring al- 
gorithm. Also, the baseline case when all the 
John Smiths are considered to be the same person 
achieves 83% precision and 100% recall. The high 
initial precision is mainly due to the fact that the 
MUC algorithm assumes that all errors are equal. 
We have also tested our system on other classes 
of cross-document coreference like names of compa- 
nies, and events. Details about these experiments 
can be found in \[1\]. 
& 
== 
2 ~D 
I1. 
100 
90 { 
80' 
7O 
60 
50 
40 
30 
20 
10 
0 
0 
Precision/Recall vs Threshold 
,,j. ~gK,, o ; : ; : ; : ; : ; : ; : ; : 
~,.~--~. MUC AIg: Precision --e-- , 
. MUC AIg: Recall --e--. 
'~/'.~. MUC AIg: F-Measure-B-- 
', "B'.B.. B..B" ~W. 
"O--EF- O - -O. B--ID-.B.. B. 
~K "13 
I I I I I I I I I 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Threshold 
Figure 8: Precision, Recall, and F-Measure Using 
the MUC Algorithm With Training On the Sum- 
maries 
m EE 
P. 
Precision/Recall vs Threshold 
100 ~"'.0- ' o 9 o 9 : ; : ; : ; : ; : ; : 
90 - &'"~'~ MUC AIg: Precision .-e- 
'r-/e- -+,~-~ MUC AIg: Recall -+--- 
80' ', ',, MUC AIg: F-Measure -m--. ~ \[\]'-B 
70 \ "13. 
60 -.+..~. ~'~'B--B.~ 
50 "'~'" "~3"'B"B'G"O 
40 "'+-'+"+--+--~_ _~__+..+ 
30 
20 
10 
0 I I I I I I I F i 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Threshold 
Figure 9: Precision, Recall, and F-Measure Using 
the MUC Algorithm With Training On Entire Ar- 
ticles 
Conclusions 
The TIPSTER phase III program has allowed us to 
explore some of the potential application areas of 
coreference annotation. We have reported on our 
strongest results, a summarization system and a 
cross-document coreference system for names. 
The query-sensitive text summarization system 
is nearly as effective as full text documents for 
determining whether a document is relevant to 
the query. The system uses a limited class of 
coreference-based relations between the query and 
the document to select sentences which represent 
instantiations of entities, events, or concepts artic- 
ulated in the query. 
As a novel research problem, cross document 
160 
coreference provides an different perspective from 
related phenomenon like named entity recognition 
and within document coreference. Our system 
takes summaries about an entity of interest and 
uses various information retrieval metrics to rank 
the similarity of the summaries. We found it quite 
challenging to arrive at a scoring metric that sat- 
isfied our intuitions about what was good system 
output v.s. bad, but we have developed a scoring 
algorithm that is an improvement for this class of 
data over other within document coreference scor- 
ing algorithms. Our results are quite encouraging 
with potential performance being as good as 84.6% 
(F-Measure). 
Future Goals 
Central to the future of this research program is 
the CAMP software system. We are continually re- 
fining and extending the software to better capture 
the coreference relations that we need and to re- 
duce genre dependent aspects of the system. We 
are currently exploring visualization interfaces to 
both within and cross-document coreference which 
we believe will provide strong motivation for im- 
portance of corefence annotation of free text data- 
bases. In addition, we are interested in generating 
cross-document summaries based on similar tech- 
niques to our within document summarization sys- 
tem. 

References 

\[1\] Bagga Amit and Breck Baldwin. How much 
processing is required for cross-document 
coreference? In The First International Con- 
ference on Language Resources and Evaluation 
on Linguistics Coreference, Granada, Spain, 
1998. 

\[2\] Breck Baldwin. CogNIAC: High precision 
coreference with limited knowledge and lin- 
guistic resources. In Proceedings of the A CL 
Workshop on Operational Factors in Practical, 
Robust Anaphora resolution /or Unrestricted 
Texts, pages 38-45, Madrid, Spain, June 1997. 

\[3\] Breck Baldwin, Christine Doran, Jeffrey C. 
Reynar, Michael Niv, B. Srinivas, and Mark 
Wasson. EAGLE: An extensible architecture 
for general linguistic engineering. In Proceed- 
ings of RIAO-97, Montreal, 1997. 

\[4\] Michael Bett and Jade Goldstein. Auto- 
mated query-relevant document summariza- 
tion. In Proceedings of Tipster Text Phase III 
12-Month Workshop, 1997. 

\[5\] Michael Chrzanowski, Therese Firmin, 
Lynette Hirschman, David House, In- 
derjeet Mani, Leo Obrst, Sara Shel- 
ton, Beth Sundheim, and Sandra Wag- 
ner. (SUMMAC) call for participation. 
http://www.tipster.org/summcall.htm, Jan- 
uary 1998. 

\[6\] Michael John Collins. A New Statistical Parser 
Based on Bigram Lexical Dependencies. In 
Proceedings of the 3~th Annual Meeting of the 
ACL, 1996. 

\[7\] Baldwin Breck et al. University of pennsylva- 
nia: Description of the university of pennsyl- 
vania system used for muc-6. In Proceedings of 
the Sixth Message Understanding Conference 
(MUC-6), pages 177-191, 1995. 

\[8\] Therese Hand. Tipster summarization evalu- 
ation task:dry-run evaluation results. In Pro- 
ceedings of Tipster Text Phase III 12-Month 
Workshop, 1997. 

\[9\] Daniel Karp, Yves Schabes, Martin Zaidel, 
and Dania Egedi. A freely available wide cov- 
erage morphological analyzer for english. In 
Proceedings of the 15th International Confer- 
ence on Computational Linguistics, 1994. 

\[10\] Ralph Grishman. Whither Written Language 
Evaluation? In Proceedings of the Human Lan- 
guage Technology Workshop, 1994. 

\[11\] Adwait Ratnaparkhi. A Maximum Entropy 
Part of Speech Tagger. In Eric BriU and Ken- 
neth Church, editors, Conference on Empirical 
Methods in Natural Language Processing, Uni- 
versity of Pennsylvania, May 17-18 1996. 

\[12\] Jeffrey C. Reynar and Adwait Ratnaparkhi. A 
maximum entropy approach to identifying sen- 
tence boundaries. In Proceedings of the Fifth 
Conference on Applied Natural Language Pro- 
cessing, pages 16-19, Washington, D.C., April 
1997. 

\[13\] Gerard Salton. Automatic Text Processing: 
The Transformation, Analysis, and Retrieval 
of Information by Computer. Addison-Wesley, 
1989. 

\[14\] Tomek Strzalkowski, Fang Lin, Jin Wang, 
Langdon White, and Bowden Wise. Natural 
language information retrieval and summariza- 
tion. In Proceedings of Tipster Text Phase III 
12-Month Workshop, 1997. 

\[15\] Ellen M. Voorhees and Donna Harman. 
Overview of the fifth Text REtrieval Confer- 
ence (TREC-5). In Proceedings of the Fifth 
Text REtrieval Conference (TREC-5), pages 
1-28. NIST 500-238, 1997. 

\[16\] Nina Wacholder, Yael Ravin, and Misook 
Choi. Disambiguation of proper names in text. 
In Proceedings of the Fifth Conference on Ap- 
plied Natural Language Processing, May 1997. 
