Document Fusion for Comprehensive Event Description
Christof Monz
Institute for Logic, Language and Computation
University of Amsterdam
1018 TV Amsterdam, The Netherlands
christof@science.uva.nl
www.science.uva.nl/˜christof
Abstract
This paper describes a fully imple-
mented system for fusing related news
stories into a single comprehensive de-
scription of an event. The basic compo-
nents and the underlying algorithm are
explained. The system uses a compu-
tationally feasible and robust notion of
entailment for comparing information
stemming from different documents.
We discuss the issue of evaluating doc-
ument fusion and provide some prelim-
inary results.
1 Introduction
Conventional text retrieval systems respond to a
user’s query by providing a (ranked) list of doc-
uments which potentially satisfy the information
need. After having identified a number of docu-
ments which are actually relevant, the user reads
some of those documents to get the information
requested. To be sure to get a comprehensive ac-
count of a particular topic, the list of documents
one has to read may be rather long, including a se-
vere amount of redundancy; i.e., documents par-
tially conveying the same information.
Although this problems basically holds for any
text retrieval situation, where comprehensiveness
is relevant, it becomes particularly evident in the
retrieval of news texts.
News agencies, such as AP, BBC, CNN, or
Reuters, often describe the same event differ-
ently. For instance, they provide different back-
ground information, helping the reader to situate
the story, they interview different people to com-
ment on an event, and they provide additional,
conflicting or more accurate information, depend-
ing on their sources.
To get a description of an event which is as
comprehensive as possible and also as short as
possible, a user has to compile his or her own
description by taking parts of the original news
stories, ignoring duplicate information. Typical
users include journalists and intelligence analysts,
for whom compiling and fusing information is an
integral part of their work (Carbonell et al., 2000).
Obviously, if done manually, this process can be
rather laborious as it involves numerous compar-
isons, depending on the number and length of the
documents.
The aim of this paper is to describe an approach
automatizing this process by fusing information
stemming from different documents to generate
a single comprehensive document, containing the
information of all original documents without re-
peating information which is conveyed by two or
more documents.
The work described in this paper is closely re-
lated to the area of multi-document summariza-
tion (Barzilay et al., 1999; Mani and Bloedorn,
1999; McKeown and Radev, 1995; Radev, 2000),
where related documents are analyzed to use fre-
quently occurring segments for identifying rele-
vant information that has to be included in the
summary. Our work differs from the work on
multi-document summarization as we focus on
document fusion disregarding summarization. On
the contrary, we are not aiming for the shortest
description containing the most relevant informa-
tion, but for the shortest description containing
all information. For instance, even historic back-
ground information is included, as long as it al-
lows the reader to get a more comprehensive de-
scription of an event.
Although the techniques that are used for
multi-document fusion and multi-document sum-
marization are similar, the task of fusion is com-
plementary to the summarization task. They dif-
fer in the way that, roughly speaking, multi-
document summarization is the intersection of
information within a topic, whereas multi-
document fusion is the union of information.
They are similar to the extent that in both cases
nearly equivalent information stemming from dif-
ferent documents within the topic has to be iden-
tified as such.
The remainder of this paper is structured as fol-
lows: Section 2 introduces the main components
and challenges of implementing a document fu-
sion system. Issues of evaluating document fu-
sion and some preliminary evaluation of our sys-
tem are presented in Section 3. In Section 4,
some conclusions and prospects on future work
are given.
2 Fusing Documents
Before developing a document fusion system,
some basic issues have to be considered.
1. On which level of granularity are the docu-
ments fused (i.e., word or phrase level, sen-
tence level, or paragraph level?
2. How to decide whether news fragments from
different sources convey the same informa-
tion?
3. How to ensure readability of the fused docu-
ment? I.e., where should information stem-
ming from different documents be placed in
the fused document, retaining a natural flow
of information.
Each of the these issues is addressed in the fol-
lowing subsections.
2.1 Segmentation
In the current implementation, we decided to
fuse documents on paragraph level for two rea-
sons: First, paragraphs are less context-dependent
than sentences and are therefore easier to com-
pare. Second, compiling paragraphs yields a bet-
ter readability of the fused document. It should
be noted that paragraphs are rather short in news
stories, rarely being longer than three sentences.
When putting together (fusing) pieces of text
from different sources in a way that was not an-
ticipated by the writers of the news stories, it can
introduce information gaps. For instance, if a
paragraph containing a pronoun is taken out of its
original context and placed in a new context (the
fused document), this can lead to dangling pro-
nouns, which cannot be correctly resolved any-
more. In general, this problem does not only hold
for pronouns but for all kind of anaphoric ex-
pressions such as pronouns, definite noun phrases
(e.g., the negotiations) and anaphoric adverbials
(e.g., later). To cope with this problem simple
segmentation is applied as a pre-processing step
where paragraphs that contain pronouns or simple
definite noun phrases are attached to the preced-
ing paragraph. A more sophisticated approach to
text segmentation is described in (Hearst, 1997).
Obviously, it would be better to use an au-
tomatic anaphora resolution component to cope
with this problem, see, e.g., (Kennedy and Bogu-
raev, 1996; Kameyama, 1997), where anaphoric
expressions are replaced by their antecedents, but
at the moment, the integration of such a compo-
nent remains future work.
2.2 Informativity
(Radev, 2000) describes 24 cross-document rela-
tions that can hold between their segments, one of
which is the subsumption (or entailment) relation.
In the context of document fusion, we focus on
the entailment relation and how it can be formally
defined; unfortunately, (Radev, 2000) provides no
formal definition for any of the relations.
Computing the informativity of a segment
compared to another segment is an essential task
during document fusion. Here, we say that the i-
th segment of document d (si;d) is more informa-
tive than thej-th segment of documentd0 (sj;d0) if
si;d entails sj;d0. In theory, this should be proven
logically, but in practice this is far beyond the cur-
rent state of the art in natural language processing.
Additionally, a binary logical decision might also
be too strict for simulating the human understand-
ing of entailment.
A simple but nevertheless quite effective solu-
tion is based on one of the simpler similarity mea-
sures in information retrieval (IR), where texts are
simply represented as bags of (weighted) words.
The definition of the entailment score (es) is given
in (1). es(si;d;sj;d0) compares the sum of the
weights of terms that appear in both segments to
the total sum weights of sj;d0.
es(si;d;sj;d0) =
P
tk2si;d\sj;d0
idf k
P
tk2sj;d0
idf k (1)
The weight of a term ti is its inverse document
frequency (idf i), as defined in (2), where N is the
number of all segments in the set of related doc-
uments (the topic) and ni is the number of seg-
ments in which the term ti occurs.
idf i = log
 N
ni
 
(2)
Terms which occur in many segments (i.e., for
which ni is rather large), such as the, some, etc.,
receive a lower idf -score than terms that occur
only in a few segments. The underlying intuition
of the idf -score is that terms with a higher idf -
score are better suited for discriminating the con-
tent of a particular segment from the other seg-
ments in the topic, or to put it differently, they are
more content-bearing. Note, that the logarithm in
(2) is only used for dampening the differences.
The entailment score es(si;d;sj;d0) measures
how many of the words of the segment si;d oc-
cur in sj;d0, and how important those words are.
This is obviously a very shallow approach to en-
tailment computation, but nevertheless it proved
to be effective, see (Monz and de Rijke, 2001).
2.3 Implementation
In this subsection, we present the general algo-
rithm underlying the implementation, given a set
of documents belonging to topic T. The imple-
mentation has to tackle two basic tasks. First,
identify segments that are entailed by other seg-
ments and use the more informative one. Second,
place the remaining segments at positions with
similar content. The fusion algorithm depicted in
Figure 1 consists of five steps.
1. is basically a pre-processing step as ex-
plained above. 2. computes pairwise the cross-
document entailment scores for all segments inT.
Although the pairwise computation of es and sim
is exponential in the number of documents in T,
it still remains computationally tractable in prac-
tice. For instance, for a topic containing 4 docu-
ments (the average case) it takes 10 CPU seconds
to compute all entailment and similarity relations.
For a topic containing 8 documents (an artificially
constructed extreme case) it takes 66 CPU sec-
onds; both on a 600 MHz Pentium III PC.
In 3., one of the documents is taken as base for
the fusion process. Starting with a ‘real’ docu-
ment improves the readability of the final fused
documents as it imposes some structure on the
fusion process. There are several ways to select
the base document. For instance, take the docu-
ment with the most unique terms, or the document
with the highest document weight (sum of all idf -
scores). In the current implementation we simply
took the longest document within the topic, which
ensures a good base coverage of an event.
4. and 5. are the actual fusion steps. Step 4. re-
places a segment si;dF in the fused document by
a segment sj;d0 from another document if sj;d0 is
the segment maximally entailing si;dF and if it is
significantly (above the threshold es) more infor-
mative than si;dF . Choosing an optimal value for
 es is essential for the effectiveness of the fusion
system. Section 3 discusses some of our experi-
ments to determine  es.
Step 5. is kind of complementary to step 4.,
where related but more informative segments are
identified. Step 5. identifies segments that add
new information to dF, where a segment sj;d0 is
new if it has low similarity to all segments in dF,
i.e., if the the similarity score is below the thresh-
old  sim. If a segment sj;d0 is new, it is placed
right after the segment in dF to which it is most
similar.
Similarity is implemented as the traditional co-
sine similarity in information retrieval, as defined
in (3). This similarity measure is also known
as the tfc.tfc measure, see (Salton and Buckley,
1988).
sim(si;d;sj;d0) =
P
tk2si;d\sj;d0
wk;si;d wk;sj;d0
r P
tk2si;d
w2si;d P
tk2sj;d0
w2sj;d0
(3)
Where wk;si;d is the weight associated with the
term tk in segment si;d. In the nominator of (3),
1. segmentize all documents in T
2. for all si;d;sj;d0 s.t. d;d02T and d6= d0: compute es(si;d;sj;d0)
3. select a document d2T as fusion base document: dF
4. for all si;dF : find sj;d0 s.t. dF 6= d0 and sj;d0 = arg maxs
k;d0
: es(sk;d0;si;dF) > es(si;dF;sk;d0)
if es(sj;d0;si;dF) > es then replace si;dF by sj;d0 in the fused document
5. for all sj;d0 s.t. sj;d062dF:
if for all si;dF : sim(si;dF;sj;d0) < sim,
then find the most similar si;dF : si;dF = arg maxs
k;dF
: sim(sj;d0;sk;dF)
and place sj;d0 between si;dF and si+1;dF
Figure 1: Sketch of the document fusion algorithm.
the weights of the terms that occur insi;d andsj;d0
are summed up. The denominator is used for nor-
malization. Otherwise, longer documents tend to
result in a higher similarity score. In the current
implementation wk;si;d = idf k for all si;d. The
reader is referred to (Salton and Buckley, 1988;
Zobel and Moffat, 1998) for a broad spectrum of
similarity measures for information retrieval.
3 Evaluation Issues
The document fusion system is evaluated in two
steps. First, the effectiveness of entailment detec-
tion is evaluated, which is the key component of
our system. Then we present some preliminary
evaluation of the whole system focusing on the
quality of the fused documents.
3.1 Evaluating Entailment
Recently, we have started to build a small test col-
lection for evaluating entailment relations. The
reader is referred to (Monz and de Rijke, 2001)
for more details on the results presented in this
subsection.
For each of the 21 topics in our test corpus two
documents in the topic were randomly selected,
and given to a human assessor to determine all
subsumption relations between segments in dif-
ferent documents (within the same topic). Judg-
ments were made on a scale 0–2, according to the
extent to which one segment was found to entail
another.
Out of the 12083 possible subsumption rela-
tions between the text segments, 501 (4.15%) re-
ceived a score of 1, and 89 (0.73%) received a
score of 2.
Let a subsumption pair be an ordered pair of
segments (si;d, sj;d0) that may or may not stand
in the subsumption relation, and let a correct sub-
sumption pair be a subsumption pair (si;d, sj;d0)
for which si;d does indeed entail sj;d0. Further, a
computed subsumption pair is a subsumption pair
for which our subsumption method has produced
a score above the subsumption threshold.
Then, precision is the fraction of computed
subsumption pairs that is correct:
Precision = number of correct subsumption pairs computedtotal number of subsumption pairs computed :
And recall is the proportion of the total number of
correct subsumption pairs that were computed:
Recall = number of correct subsumption pairs computedtotal number of correct subsumption pairs :
Observe that precision and recall depend on the
subsumption threshold that we use.
We computed average recall and precision at 11
different subsumption thresholds, ranging from 0
to 1, with :1 increments; the average was com-
puted over all topics. The results are summarized
in Figures 2 (a) and (b).
Since precision and recall suggest two differ-
ent optimal subsumption thresholds, we use the F-
Score, or harmonic mean, which has a high value
only when both recall and precision are high.
F = 21
Recall +
1
Precision
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 Subsumption Threshold
 Precision
Human Judgm. > 0Human Judgm. > 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 Subsumption Threshold
 Recall
Human Judgm. > 0Human Judgm. > 1
(a) (b)
Figure 2: (a) Average precision with human judgments > 0 and > 1. (b) Average recall with human
judgments > 0 and > 1.
The average F-scores are given in Figure 3.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
 Subsumption Threshold
 F−Score
Human Judgm. > 0Human Judgm. > 1
Figure 3: Average F-scores with human judg-
ments > 0 and > 1.
The optimal subsumption threshold for human
judgments > 0 is around 0.18, while it is approx-
imately 0:4 for human judgments > 1. This con-
firms the intuition that a higher threshold is more
effective when human judgments are stricter.
3.2 Evaluating Fusion
In the introduction, it was pointed out that docu-
ment fusion by hand can be rather laborious, and
the same holds for the evaluation of automatic
document fusion. Similar to automatic summa-
rization, there are no standard document collec-
tions or clear evaluation criteria aiding to autom-
atize the process of evaluation. One approach
could be to focus on news stories which mention
their sources. For instance CNN’s new stories of-
ten say that “AP and Reuters contributed to this
story”. On the other hand one has to be cautious
to take those news stories as gold standard as the
respective contributions of the journalist and his
or her sources are not made explicit.
In the area of multi-document summarization,
there is a distinction between intrinsic and extrin-
sic evaluation, see (Mani et al., 1998). Intrin-
sic evaluation judges the quality directly based on
analysis of the summary. Usually, a human judge
assesses the quality of a summary based on some
standardized evaluation criteria.
In extrinsic evaluation, the usefulness of a sum-
mary is judged based on how it affects the com-
pletion of some other task. A typical task used
for extrinsic evaluation is ad-hoc retrieval, where
the relevance of a retrieved document is assessed
by a human judge based on the document’s sum-
mary. Then, those judgments are compared to
judgments based on original documents, see, e.g.,
(Brandow et al., 1995; Mani and Bloedorn, 1999).
At this stage we have just carried out some pre-
liminary evaluation. The test collection consists
of 69 news stories categorized into 21 topics. Cat-
egorization was done by hand, but it is also pos-
sible to have information filtering, see (Robertson
and Hull, 2001), or topic detection and tracking
(TDT) tools carrying out this task (Allan et al.,
1998). All documents belonging to the same topic
were released on the same day and describe the
same event. Table 1 provides further details on
the collection.
avg. per topic
no. of docs. 3.3 docs.
length of a doc. 612 words
length of all docs. together 2115 words
length of longest doc. 783 words
length of shortest doc. 444 words
Table 1: Test collection (21 topics, 69 docu-
ments).
In addition to the aforementioned news agen-
cies, the collection includes texts from the L.A.
Times, Washington Post and Washington Times.
In general, a segment should be included in the
fused document if it did not occur before to avoid
redundancy (False Alarm), and if it adds informa-
tion, so no information is left out (Miss). As in IR
or TDT, Miss and False Alarm tend to be inversely
related; i.e., a decrease of Miss often results in an
increase of False Alarm and vice versa.
Table 2 illustrates the different possibilities
how the system responds as to whether a seg-
ment should be included in the fused document
and how a human reader judges.
system reader: reader:
judgement include exclude
include a b
exclude c d
Table 2: Contingency table.
Then, Miss and False Alarm can be defined as
in (4) and (5), respectively.
Miss = ca+c if a+c> 0 (4)
False Alarm = bb+d if b+d> 0 (5)
The fusion impact factor (fif) describes to what
extent the different sources actually contributed to
the fused document. For instance if the fused doc-
ument solely contains segments from one source,
fif equals 0, and if all sources equally contributed
it equals 1. This can be formalized as follows:
 f = 1 
P
d2T
j1=nT nseg;d=nseg j
2 (1 1=nT) (6)
Where S is a set of related documents, and nT
is its size. nseg is the number of segments in the
fused document and nseg;d is the number of seg-
ments stemming from document d.
For our test collection, the average fusion im-
pact factor was 0.56. Of course the fif -score de-
pends on the choice of  es and  sim, in a way that
a lower value of  es or a higher value of  sim in-
creases the fif -score. In this case,  es = 0:2 and
 sim = 0:05.
Table 3 shows the length of the fused docu-
ments in average compared to the longest, short-
est, and all documents in a topic, for  es = 0:2
and  sim = 0:05.
avg. compression
ratio per topic
all docs. together 0.55
longest doc. 1.36
shortest doc. 2.55
Table 3: Compression ratios.
Measuring Miss intrinsically is extremely labo-
rious; especially comparing the effectiveness of
different values for the thresholds  es and  sim is
infeasible in practice. Therefore, we decided to
measure Miss extrinsically. We used ad-hoc re-
trieval as the extrinsic evaluation task. The eval-
uation criterion is stated as follows: Using the
fused document of each topic as a query, what is
the average (non-interpolated) precision?
As baseline, we concatenated all documents
of each topic. This would constitute an event
description that does not miss any information
within the topic. This document is then used to
query a collection of 242,996 documents, con-
taining the 69 documents from our test collection.
Since the baseline is simply the concatenation of
all documents within the topic, one can expect
that all documents from that topic receive a high
rank in the set of retrieved documents. This aver-
age precision forms the optimal performance for
that topic. For instance, if a topic contains three
documents, and the ad-hoc retrieval ranks those
documents as 1, 3, and 6, there are three recall
levels: 33: 3%, 66: 6%, and 100%. The precision
at these levels is 1/1, 2/3, and 3/6 respectively,
which averages to 0:7 2.
The next step is to compare the actually fused
documents to the baseline. It is to be expected that
the performance is worse, because the fused docu-
ments do not contain segments which are entailed
by other segments in the topic. For instance, if
the fused document for the aforementioned topic
is used as a query and the original documents of
the topic are ranked as 2, 4, and 9, the average
precision is (1/2 + 2/4 + 3/9)/3 = 0: 4.
Compared to 0:7 2 for the baseline, fusion leads
to a decrease of effectiveness of approximately
38.5%. Figure 4, gives the averaged precision for
the different values for  es.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.7
0.8
0.9
1
 Entailment Threshold 
 Avgerage Precision
SystemBaseline
Figure 4: Average precision for ad-hoc retrieval.
It is not obvious how to interpret the numerical
value of the ad-hoc retrieval precision in terms of
Miss, but the degree of deviation from the base-
line gives a rough estimate of the completeness
of a fused document. At least, this allows for an
ordinally scaled ranking of the different methods
(in our case different values for  es), that are used
for generating the fused documents. Figure 4 il-
lustrates that in the context of the ad-hoc retrieval
evaluation an optimal entailment threshold ( es)
lies around 0.2. Table 4 shows the decrease in re-
trieval effectiveness in percent, compared to the
baseline. The average precision at 0.2 is 0.8614,
which is just 11:5% below the baseline.
For all ad-hoc retrieval experiments, the Lnu.ltu
weighting scheme, see (Singhal et al., 1996), has
been used, which is one of the best-performing
weighting schemes in ad-hoc retrieval. In addi-
tion to the 69 documents from our collection, the
retrieval collection contains articles from Associ-
Decrease in Decrease in
 es precision  es precision
0.0 20.9% 0.6 23.8%
0.1 14.6% 0.7 23.8%
0.2 11.5% 0.8 23.8%
0.3 13.7% 0.9 23.8%
0.4 22.0% 1.0 24.4%
0.5 22.2%
Table 4: Differences to baseline retrieval.
ated Press 1988–1990 (from the TREC distribu-
tion), which also belong to the newswire or news-
paper domain. Any meta information such as the
name of the journalist or news agency is removed
to avoid matches based on that information.
In the context of multi-document summariza-
tion, (Stein et al., 2000) use topic clustering for
extrinsic evaluation. Although we did not carry
out any evaluation based on topic clustering, it
seems that it could also be applied to multi-
document fusion, given the close relationship be-
tween fusion and summarization on the one hand
and retrieval and clustering on the other hand.
4 Conclusions
The document fusion system described is just pro-
totype and there is much more space for improve-
ment. Although detecting redundancies by using
a shallow notion of entailment works reasonably
well, it is still far from perfect.
In the current implementation, text analysis is
very shallow. Pattern matching is used to avoid
dangling anaphora and lemmatization is used to
make the entailment and similarity scores unsus-
ceptible to morphological variations such as num-
ber and tense. A question for future research is to
what extent shallow parsing techniques can im-
prove the entailment scores. In particular, does
considering the relational structure of a sentence
improve computing entailment relations? This
has shown to be successful in inference-based ap-
proaches to question-answering, see (Harabagiu
et al., 2000), and document fusion might also ben-
efit from representations that are a bit deeper than
the one discussed in this paper.
Another open issue at this point is the need for
standards for evaluating the quality of document
fusion. We think that this can be done by using
standard IR measures like Miss and False Alarm.
Although Miss can be approximated extrinsically,
it is unclear whether this also possible for False
Alarm. Obviously, intrinsic evaluation is more re-
liable, but it remains an extremely laborious pro-
cess, where inter-judge disagreement is still an is-
sue, see (Radev et al., 2000).
Acknowledgments
The author would like to thank Maarten de Rijke
for providing the entailment judgments. This
work was supported by the Physical Sci-
ences Council with financial support from the
Netherlands Organization for Scientific Research
(NWO), project 612-13-001.

References
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and
Y. Yang. 1998. Topic detection and tracking pilot
study final report. In Proceedings of the Broadcast
News Transcription and Understranding Workshop
(Sponsored by DARPA).
R. Barzilay, K. McKeown, and M. Elhadad. 1999. In-
formation fusion in the context of multi-document
summarization. In Proceedings of the 37th Annual
Meeting of the Association of Computational Lin-
guistics (ACL’99).
R. Brandow, K. Mitze, and L. Rau. 1995. Automatic
condensation of electronic publications by sentence
selection. Information Processing & Management,
31(5):675–685.
J. Carbonell, D. Harman, E. Hovy, S. Maiorano,
J. Prange, and Sparck-Jones. K. 2000. Vision
statement to guide research in question answering
(Q&A) and text summarization. NIST Draft Publi-
cation.
S. Harabagiu, M. Pasca, and S. Maiorano. 2000. Ex-
periments with open-domain textual question an-
swering. In Proceedings of COLING-2000.
M. Hearst. 1997. TextTiling: Segmenting text into
multi-paragraph subtopic passages. Computational
Linguistics, 23(1):33–64.
M. Kameyama. 1997. Recognizing referential
links: An information extraction perspective. In
R. Mitkov and B. Boguraev, editors, Proceedings
of ACL/EACL-97 Workshop on Operational Factors
in Practical, Robust Anaphora Resolution for Unre-
stricted Texts, pages 46–53.
C. Kennedy and B. Boguraev. 1996. Anaphora for
everyone: Pronominal anaphora resolution without
a parser. In Proceedings of the 16th International
Conference on Computational Linguistics (COL-
ING’96). Association for Computational Linguis-
tics.
I. Mani and E. Bloedorn. 1999. Summarizing sim-
ilarities and differences among related documents.
Information Retrieval, 1(1–2):35–67.
I. Mani, D. House, G. Klein, L. Hirschman, L. Obrst,
T. Firmin, M. Chrzanowski, and B. Sundheim.
1998. The Tipster SUMMAC text summariza-
tion evaluation, final report. Technical Report
98W0000138, Mitre.
K. McKeown and D. Radev. 1995. Generating sum-
maries of multiple news articles. In Proceedings of
the 18th Annual International ACM SIGIR Confer-
ence on Research and Development in Information
Retrieval, pages 74–82.
C. Monz and M. de Rijke. 2001. Light-weight sub-
sumption checking for computational semantics. In
P. Blackburn and M. Kohlhase, editors, Proceedings
of the 3rd Workshop on Inference in Computational
Semantics (ICoS-3).
D. Radev, H. Jing, and M. Budzikowska. 2000.
Centroid-based summarization of multiple docu-
ments: Clustering, sentence extraction, and evalu-
ation. In Proceedings of the ANLP/NAACL-2000
Workshop on Summarization.
D. Radev. 2000. A common theory of information
fusion from multiple text sources, step one: Cross-
document structure. In In Proceedings of the 1st
ACL SIGDIAL Workshop on Discourse and Dia-
logue.
S. Robertson and D. Hull. 2001. The TREC-9 fil-
tering track final report. In Proceedings of The 9th
Text Retrieval Conference (TREC-9). NIST Special
Publication.
G. Salton and C. Buckley. 1988. Term-weighting ap-
proaches in automatic text retrieval. Information
Processing & Management, 24(5):513–523.
A. Singhal, G. Salton, M. Mitra, and C. Buckley.
1996. Document length normalization. Informa-
tion Processing & Management, 32(5):619–633.
G. Stein, G. Wise, T. Strzalkowski, and A. Bagga.
2000. Evaluating summaries for multiple docu-
ments in an interactive environment. In Proceed-
ings of the Second International Conference on
Language Resources and Evaluation (LREC’00),
pages 1651–1657.
J. Zobel and A. Moffat. 1998. Exploring the similarity
space. ACM SIGIR Forum, 32(1):18–34.
