Query-Relevant Summarization using FAQs
Adam Berger Vibhu O. Mittal
School of Computer Science Just Research
Carnegie Mellon University 4616 Henry Street
Pittsburgh, PA 15213 Pittsburgh, PA 15213
aberger@cs.cmu.edu mittal@justresearch.com
Abstract
This paper introduces a statistical model for
query-relevant summarization: succinctly
characterizing the relevance of a document
to a query. Learning parameter values for
the proposed model requires a large collec-
tion of summarized documents, which we
do not have, but as a proxy, we use a col-
lection of FAQ (frequently-asked question)
documents. Taking a learning approach en-
ables a principled, quantitative evaluation
of the proposed system, and the results of
some initial experiments—on a collection
of Usenet FAQs and on a FAQ-like set
of customer-submitted questions to several
large retail companies—suggest the plausi-
bility of learning for summarization.
1 Introduction
An important distinction in document summarization
is between generic summaries, which capture the cen-
tral ideas of the document in much the same way that
the abstract of this paper was designed to distill its
salient points, and query-relevant summaries, which
reflect the relevance of a document to a user-specified
query. This paper discusses query-relevant summa-
rization, sometimes also called “user-focused summa-
rization” (Mani and Bloedorn, 1998).
Query-relevant summaries are especially important
in the “needle(s) in a haystack” document retrieval
problem: a user has an information need expressed
as a query (What countries export smoked
salmon?), and a retrieval system must locate within
a large collection of documents those documents most
likely to fulfill this need. Many interactive retrieval
systems—web search engines like Altavista, for
instance—present the user with a small set of candi-
date relevant documents, each summarized; the user
must then perform a kind of triage to identify likely
relevant documents from this set. The web page sum-
maries presented by most search engines are generic,
not query-relevant, and thus provide very little guid-
ance to the user in assessing relevance. Query-relevant
summarization (QRS) aims to provide a more effective
characterization of a document by accounting for the
user’s information need when generating a summary.
Search for
relevant
documents
a0 a1
a2 a3
a4 a5
a6 a7
a8
Summarize
documents
relative to
Q
σa9 a10 a11 a12 a13a14
σa15 a16 a17 a18 a19a20
σa21 a22 a23 a24 a25a26
σa27 a28 a29 a30 a31a32
(a) (b)
Figure 1: One promising setting for query-relevant sum-
marization is large-scale document retrieval. Given a user
query a33 , search engines typically first (a) identify a set of
documents which appear potentially relevant to the query,
and then (b) produce a short characterization a34a36a35a38a37a40a39a41a33a43a42 of each
document’s relevance to a33 . The purpose of a34a36a35a38a37a40a39a41a33a43a42 is to as-
sist the user in finding documents that merit a more detailed
inspection.
As with almost all previous work on summarization,
this paper focuses on the task of extractive summariza-
tion: selecting as summaries text spans—either com-
plete sentences or paragraphs—from the original doc-
ument.
1.1 Statistical models for summarization
From a document a44 and query a45 , the task of query-
relevant summarization is to extract a portion a46 from
a44 which best reveals how the document relates to
the query. To begin, we start with a collection a47 of
a48
a44a50a49a51a45a40a49a52a46a54a53 triplets, where a46 is a human-constructed sum-
mary of a44 relative to the query a45 . From such a collec-
Snow is
not unusual
in France...
D1
S1
Q1 = Weather in Parisin December
D2
Some parents
elect to teach
their children
at home...
S2
Q2 = Homeschooling
D3
Good Will
Hunting is
about...
S3
Q3 = Academy awardwinners in 1998
a55a56a57 a58 a59 a60
a61 a62 a63a64a60a65 a66 a64a60a56 a65 a63 a67 a68 a69 a70
a60a71
a72 a73
a68
a71a66 a64a60a56 a65 a60a65
a74 a75 a62 a76a60a71a66
a77 a60a75 a78a63 a79a66 a80a56 a76a60a64a62
a75 a56 a80a60a62 a63
...
... .
..
...
...
...
Figure 2: Learning to perform query-relevant summariza-
tion requires a set of documents summarized with respect to
queries. Here we show three imaginary triplets a81a82a37a40a39a41a33a83a39a41a84a82a85 ,
but the statistical learning techniques described in Section 2
require thousands of examples.
tion of data, we fit the best function a86a88a87a90a89a91a45a40a49a51a44a40a92a94a93a95a46
mapping document/query pairs to summaries.
The mapping we use is a probabilistic one, meaning
the system assigns a value a96a90a89a97a46a50a98a99a44a50a49a52a45a83a92 to every possible
summary a46 of a89a97a44a50a49a51a45a83a92 . The QRS system will summarize
a a89a97a44a50a49a51a45a83a92 pair by selecting
a86a40a89a97a44a50a49a52a45a83a92
defa100a102a101a54a103a105a104a107a106a108a101a110a109
a111
a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92
There are at least two ways to interpret a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92 .
First, one could view a96a112a89a97a46a50a98a99a44a50a49a52a45a83a92 as a “degree of be-
lief” that the correct summary of a44 relative to a45 is a46 .
Of course, what constitutes a good summary in any
setting is subjective: any two people performing the
same summarization task will likely disagree on which
part of the document to extract. We could, in principle,
ask a large number of people to perform the same task.
Doing so would impose a distribution a96a112a89a105a113a114a98a105a44a50a49a51a45a83a92 over
candidate summaries. Under the second, or “frequen-
tist” interpretation, a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92 is the fraction of people
who would select a46 —equivalently, the probability that
a person selected at random would prefer a46 as the sum-
mary.
The statistical model a96a112a89a105a113a114a98a105a44a50a49a51a45a83a92 is parametric, the
values of which are learned by inspection of the
a48
a44a50a49a52a45a40a49a51a46a110a53 triplets. The learning process involves
maximum-likelihood estimation of probabilistic lan-
guage models and the statistical technique of shrink-
age (Stein, 1955).
This probabilistic approach easily generalizes to
the generic summarization setting, where there is no
query. In that case, the training data consists of a48 a44a50a49a51a46a110a53
pairs, where a46 is a summary of the document a44 . The
goal, in this case, is to learn and apply a mapping
a115
a87a90a44a116a93a117a46 from documents to summaries. That is,
a118a120a119
a121a123a122
A single
FAQ
document
a124a43a125
a126a128a127
a129a36a130
a131a133a132
Summary of
document with
respect to Q2 . 
. .
What is amniocentesis?
Amniocenteses, or amnio, is
a prenatal test in which...
What can it detect?
One of the main uses of
amniocentesis is to detect
chromosomal abnormalities...
What are the risks of amnio?
The main risk of amnio is
that it may increase the
chance of miscarriage...
Figure 3: FAQs consist of a list of questions and answers
on a single topic; the FAQ depicted here is part of an in-
formational document on amniocentesis. This paper views
answers in a FAQ as different summaries of the FAQ: the an-
swer to the a134 th question is a summary of the FAQ relative to
that question.
find
a115
a89a91a44a40a92
defa100a102a101a54a103a105a104a107a106a108a101a110a109
a111
a96a112a89a97a46a50a98a99a44a40a92
1.2 Using FAQ data for summarization
We have proposed using statistical learning to con-
struct a summarization system, but have not yet dis-
cussed the one crucial ingredient of any learning pro-
cedure: training data. The ideal training data would
contain a large number of heterogeneous documents, a
large number of queries, and summaries of each doc-
ument relative to each query. We know of no such
publicly-available collection. Many studies on text
summarization have focused on the task of summariz-
ing newswire text, but there is no obvious way to use
news articles for query-relevant summarization within
our proposed framework.
In this paper, we propose a novel data collection
for training a QRS model: frequently-asked question
documents. Each frequently-asked question document
(FAQ) is comprised of questions and answers about
a specific topic. We view each answer in a FAQ as
a summary of the document relative to the question
which preceded it. That is, an FAQ with a135 ques-
tion/answer pairs comes equipped with a135 different
queries and summaries: the answer to the a136 th ques-
tion is a summary of the document relative to the a136 th
question. While a somewhat unorthodox perspective,
this insight allows us to enlist FAQs as labeled train-
ing data for the purpose of learning the parameters of
a statistical QRS model.
FAQ data has some properties that make it particu-
larly attractive for text learning:
a137 There exist a large number of Usenet FAQs—
several thousand documents—publicly available
on the Web1. Moreover, many large compa-
nies maintain their own FAQs to streamline the
customer-response process.
a137 FAQs are generally well-structured documents,
so the task of extracting the constituent parts
(queries and answers) is amenable to automation.
There have even been proposals for standardized
FAQ formats, such as RFC1153 and the Minimal
Digest Format (Wancho, 1990).
a137 Usenet FAQs cover an astonishingly wide variety
of topics, ranging from extraterrestrial visitors to
mutual-fund investing. If there’s an online com-
munity of people with a common interest, there’s
likely to be a Usenet FAQ on that subject.
There has been a small amount of published work
involving question/answer data, including (Sato and
Sato, 1998) and (Lin, 1999). Sato and Sato used FAQs
as a source of summarization corpora, although in
quite a different context than that presented here. Lin
used the datasets from a question/answer task within
the Tipster project, a dataset of considerably smaller
size than the FAQs we employ. Neither of these paper
focused on a statistical machine learning approach to
summarization.
2 A probabilistic model of
summarization
Given a query a45 and document a44 , the query-relevant
summarization task is to find
a46a105a138a94a139
a101a54a103a99a104a140a106a108a101a54a109
a111
a96a112a89a97a46a112a98a105a44a50a49a51a45a83a92a141a49
the a posteriori most probable summary for a89a97a44a50a49a51a45a83a92 .
Using Bayes’ rule, we can rewrite this expression as
a46 a138
a100 a101a54a103a99a104a50a106a108a101a110a109
a111
a96a112a89a97a45a142a98a99a46a128a49a52a44a143a92a133a96a112a89a97a46a112a98a105a44a40a92a82a49
a144
a101a54a103a99a104a50a106a108a101a110a109
a111
a96a112a89a97a45a142a98a105a46a145a92
a146 a147a149a148 a150
relevance
a96a112a89a91a46a112a98a105a44a40a92
a146 a147a145a148 a150
fidelity
a49 (1)
where the last line follows by dropping the dependence
on a44 in a96a112a89a91a45a142a98a99a46a128a49a51a44a40a92 .
Equation (1) is a search problem: find the summary
a46 a138 which maximizes the product of two factors:
1. The relevance a96a112a89a91a45a151a98a105a46a149a92 of the query to the sum-
mary: A document may contain some portions
directly relevant to the query, and other sections
bearing little or no relation to the query. Con-
sider, for instance, the problem of summarizing a
1Two online sources for FAQ data are www.faqs.org
and rtfm.mit.edu.
survey on the history of organized sports relative
to the query “Who was Lou Gehrig?” A summary
mentioning Lou Gehrig is probably more relevant
to this query than one describing the rules of vol-
leyball, even if two-thirds of the survey happens
to be about volleyball.
2. The fidelity a96a112a89a91a46a112a98a105a44a143a92 of the summary to the
document: Among a set of candidate sum-
maries whose relevance scores are comparable,
we should prefer that summary a46 which is most
representative of the document as a whole. Sum-
maries of documents relative to a query can of-
ten mislead a reader into overestimating the rel-
evance of an unrelated document. In particular,
very long documents are likely (by sheer luck)
to contain some portion which appears related to
the query. A document having nothing to do with
Lou Gehrig may include a mention of his name
in passing, perhaps in the context of amyotropic
lateral sclerosis, the disease from which he suf-
fered. The fidelity term guards against this occur-
rence by rewarding or penalizing candidate sum-
maries, depending on whether they are germane
to the main theme of the document.
More generally, the fidelity term represents a
prior, query-independent distribution over candi-
date summaries. In addition to enforcing fidelity,
this term could serve to distinguish between more
and less fluent candidate summaries, in much the
same way that traditional  models steer a
speech dictation system towards more fluent hy-
pothesized transcriptions.
In words, (1) says that the best summary of a doc-
ument relative to a query is relevant to the query (ex-
hibits a large a96a112a89a97a45a142a98a105a46a145a92 value) and also representative of
the document from which it was extracted (exhibits a
large a96a112a89a97a46a112a98a105a44a40a92 value). We now describe the paramet-
ric form of these models, and how one can determine
optimal values for these parameters using maximum-
likelihood estimation.
2.1 Language modeling
The type of statistical model we employ for both
a96a112a89a91a45a142a98a99a46a149a92 and a96a112a89a91a46a112a98a105a44a143a92 is a unigram probability distri-
bution over words; in other words, a  model.
Stochastic models of  have been used exten-
sively in speech recognition, optical character recogni-
tion, and machine translation (Jelinek, 1997; Berger et
al., 1994). Language models have also started to find
their way into document retrieval (Ponte and Croft,
1998; Ponte, 1998).
The fidelity model a96a152a89a91a46a112a98a105a44a143a92
One simple statistical characterization of an a153 -word
document a44 a100 a48a114a154a156a155 a49 a154a133a157 a49a149a158a145a158a149a158 a154a128a159 a53 is the frequency of
each word in a44 —in other words, a marginal distribu-
tion over words. That is, if word a160 appears a136 times in
a44 , then a96a162a161a143a89a91a160a163a92
a100
a136a165a164a128a153 . This is not only intuitive, but
also the maximum-likelihood estimate for a96a162a161a40a89a97a160a166a92 .
Now imagine that, when asked to summarize a44 rel-
ative to a45 , a person generates a summary from a44 in the
following way:
a137 Select a length
a167 for the summary according to
some distribution a168 a161 .
a137 Do for
a169
a100a116a170
a49a51a171a156a49a145a158a149a158a145a158a38a167 :
– Select a word a160 at random according to the
distributiona96 a161 . (That is, throw all the words
in a44 into a bag, pull one out, and then re-
place it.)
– Set a46a82a172a40a173a174a160 .
In following this procedure, the person will generate
the summary a46 a100 a48a114a175a110a155 a49 a175a149a157 a49a149a158a149a158a145a158 a175a82a176 a53 with probability
a96a112a89a97a46a112a98a105a44a40a92
a100
a168 a161 a89a97a167a177a92
a176
a178
a172a180a179
a155
a96 a161 a89
a175
a172 a92 (2)
Denoting by a181 the set of all known words, and by
a182
a89a91a160a88a183a184a44a143a92 the number of times that word a160 appears in
a44 , one can also write (2) as a multinomial distribution:
a185
a89a97a46a50a98a99a44a40a92
a100
a168a186a161a143a89a91a167a177a92
a178
a187a40a188a54a189
a185
a89a97a160a166a92a41a190a52a191
a187a40a188
a161a133a192
a158 (3)
In the text classification literature, this characteriza-
tion of a44 is known as a “bag of words” model, since the
distribution a96a162a161 does not take account of the order of
the words within the document a44 , but rather views a44 as
an unordered set (“bag”) of words. Of course, ignoring
word order amounts to discarding potentially valuable
information. In Figure 3, for instance, the second ques-
tion contains an anaphoric reference to the preceding
question: a sophisticated context-sensitive model of
 might be able to detect thatit in this context
refers to amniocentesis, but a context-free model
will not.
The relevance model a96a152a89a91a45a142a98a99a46a149a92
In principle, one could proceed analogously to (2),
and take
a96a112a89a97a45a142a98a105a46a145a92
a100
a168
a111
a89a97a136a165a92
a176
a178
a172a38a179
a155
a96
a111
a89a97a193a141a172a97a92a141a158 (4)
for a length-a136 query a45 a100 a48 a193 a155 a49a51a193 a157 a158a149a158a145a158a41a193a141a194a123a53 . But this strat-
egy suffers from a sparse estimation problem. In con-
trast to a document, which we expect will typically
contain a few hundred words, a normal-sized summary
contains just a handful of words. What this means is
that a96 a111 will assign zero probability to most words, and
a195a196a197a198a199a200a201a202
a195a203a204a205 a195a203a206a205a195a203a204a204 a203a204a207a195 a195a208a206a204a203 a195a203a206a207
a195a209a200a201a210a196a203
a195a211a200a209a205 a195a211a200a209a207 a195a211a200a209a206a195a211a200a209a204
Figure 4: The relevance a212a40a35a38a33a94a213a91a84a82a214a215a145a42 of a query to the a216 th an-
swer in document a217 is a convex combination of five distribu-
tions: (1) a uniform modela212a83a218 . (2) a corpus-wide model a212a220a219 ;
(3) a model a212a83a221a223a222 constructed from the document containing
a84 a214a215 ; (4) a model a212a120a224a40a222a225 constructed from a84 a214a215 and the neigh-
boring sentences in a37a162a214 ; (5) a model a212a83a226 a222a225 constructed from
a84 a214a215 alone. (The a212a120a224 distribution is omitted for clarity.)
any query containing a word not in the summary will
receive a relevance score of zero.
(The fidelity model doesn’t suffer from zero-
probabilities, at least not in the extractive summariza-
tion setting. Since a summary a46 is part of its contain-
ing document a44 , every word in a46 also appears in a44 ,
and therefore a96 a161 a89 a175 a92a228a227a230a229 for every word a175 a183a231a46 . But
we have no guarantee, for the relevance model, that a
summary contains all the words in the query.)
We address this zero-probability problem by inter-
polating or “smoothing” the a96 a111 model with four more
robustly estimated unigram word models. Listed in
order of decreasing variance but increasing bias away
from a96 a111 , they are:
a96a83a232 : a probability distribution constructed using
not only a46 , but also all words within the six sum-
maries (answers) surrounding a46 in a44 . Since a96 a232
is calculated using more text than just a46 alone, its
parameter estimates should be more robust that
those of a96 a111 . On the other hand, the a96 a232 model is,
by construction, biased away from a96
a111 , and there-
fore provides only indirect evidence for the rela-
tion between a45 and a46 .
a96a162a161 : a probability distribution constructed over the
entire document a44 containing a46 . This model has
even less variance than a96a83a232 , but is even more bi-
ased away from a96 a111 .
a96a83a233 : a probability distribution constructed over all
documents a44 .
a96a43a234 : the uniform distribution over all words.
Figure 4 is a hierarchical depiction of the various
 models which come into play in calculating
a96a112a89a91a45a142a98a99a46a149a92 . Each summary model a96
a111 lives at a leaf node,
and the relevancea96a112a89a97a45a142a98a99a46a149a92 of a query to that summary is
a convex combination of the distributions at each node
Algorithm: Shrinkage for a235a236 estimation
Input: Distributions a96 a111 a49a41a96a162a161a237a49a41a96a83a233a162a49a208a96 a234 ,
a238
a100
a48
a44a50a49a51a45a40a49a52a46a110a53 (not used to
estimate a96 a111 a49a208a96 a161 a49a208a96 a233 a49a208a96a43a234 )
Output Model weights a235a236 a100 a48 a236 a111 a49 a236 a232a239a49 a236 a161a83a49 a236 a233a83a49 a236 a234 a53
1. Set a236 a111 a173 a236 a232a240a173 a236 a161a241a173 a236 a233a177a173 a236 a234 a173 a170 a164a223a242
2. Repeat until a235a236 converges:
3. Set a243a82a244a223a245a156a246a133a247a52a248 a100 a229 for a249a250a183 a48 a46a128a49a41a251a231a49a52a44a50a49a99a47a40a49a97a252a253a53
5. (E-step) a243a82a244a223a245a156a246a133a247
a111
a173a254a243a82a244a223a245a165a246a123a247
a111a143a255a1a0a3a2
a4
a2
a191
a5
a192
a4
a191
a5a7a6a38a111
a192
(similarly for a251a231a49a52a44a50a49a99a47a40a49a91a252 )
6. (M-step)a236 a111 a173 a8a10a9a12a11a14a13a12a15 a2a16
a222
a8a10a9a12a11a14a13a12a15
a222
(similarly for a236 a232 a49 a236 a161 a49 a236 a233 a49 a236 a234 )
along a path from the leaf to the root2:
a17
a35a38a33a94a213a91a84a82a42a19a18a21a20 a226 a212 a226 a35a38a33a43a42a23a22a24a20a128a224a90a212a83a224a94a35a38a33a36a42a10a22 (5)
a20a128a221a110a212a120a221a156a35a38a33a43a42a25a22a26a20a133a219a114a212a220a219a128a35a38a33a36a42a27a22a28a20a128a218a156a212a120a218a237a35a38a33a43a42
We calculate the weighting coefficients a235a236 a100
a48 a236
a111
a49
a236
a232a108a49
a236
a161a237a49
a236
a233a120a49
a236
a234 a53 using the statistical technique
known as shrinkage (Stein, 1955), a simple form of
the EM algorithm (Dempster et al., 1977).
As a practical matter, if one assumes the a168 a111 model
assigns probabilities independently of a46 , then we can
drop the a168 a111 term when ranking candidate summaries,
since the score of all candidate summaries will re-
ceive an identical contribution from the a168 a111 term. We
make this simplifying assumption in the experiments
reported in the following section.
3 Results
To gauge how well our proposed summarization tech-
nique performs, we applied it to two different real-
world collections of answered questions:
Usenet FAQs: A collection of a171a223a229 a170 frequently-
asked question documents from the comp.*
Usenet hierarchy. The documents contained a170a3a29 a229a223a229
questions/answer pairs in total.
Call-center data: A collection of questions
submitted by customers to the companies Air
Canada, Ben and Jerry, Iomagic, and Mylex,
along with the answers supplied by company
2By incorporating a
a212a83a221 model into the relevance model,
equation (6) has implicitly resurrected the dependence on a37
which we dropped, for the sake of simplicity, in deriving (1).
representatives. These four documents contain
a170
a229a165a49a12a30a32a31a128a242 question/answer pairs.
We conducted an identical, parallel set of experi-
ments on both. First, we used a randomly-selected
subset of 70% of the question/answer pairs to calcu-
late the  models a96 a111 a49a41a96 a232 a49a208a96 a161 a49a41a96 a233 —a simple
matter of counting word frequencies. Then, we used
this same set of data to estimate the model weights
a235
a236
a100
a48 a236
a111
a49
a236
a232 a49
a236
a161 a49
a236
a233 a49
a236
a234a90a53 using shrinkage. We re-
served the remaining 30% of the question/answer pairs
to evaluate the performance of the system, in a manner
described below.
Figure 5 shows the progress of the EM algo-
rithm in calculating maximum-likelihood values for
the smoothing coefficients a235a236 , for the first of the three
runs on the Usenet data. The quick convergence and
the final a235a236 values were essentially identical for the
other partitions of this dataset.
The call-center data’s convergence behavior was
similar, although the final a235a236 values were quite differ-
ent. Figure 6 shows the final model weights for the
first of the three experiments on both datasets. For
the Usenet FAQ data, the corpus  model is the
best predictor of the query and thus receives the high-
est weight. This may seem counterintuitive; one might
suspect that answer to the query (a46 , that is) would be
most similar to, and therefore the best predictor of,
the query. But the corpus model, while certainly bi-
ased away from the distribution of words found in the
query, contains (by construction) no zeros, whereas
each summary model is typically very sparse.
In the call-center data, the corpus model weight
is lower at the expense of a higher document model
weight. We suspect this arises from the fact that the
documents in the Usenet data were all quite similar to
one another in lexical content, in contrast to the call-
center documents. As a result, in the call-center data
the document containing a46 will appear much more rel-
evant than the corpus as a whole.
To evaluate the performance of the trained QRS
model, we used the previously-unseen portion of the
FAQ data in the following way. For each test a89a91a44a50a49a51a45a83a92
pair, we recorded how highly the system ranked the
correct summary a46 a138 —the answer to a45 in a44 —relative
to the other answers in a44 . We repeated this entire se-
quence three times for both the Usenet and the call-
center data.
For these datasets, we discovered that using a uni-
form fidelity term in place of the a96a112a89a97a46 a98a83a44a40a92 model de-
scribed above yields essentially the same result. This
is not surprising: while the fidelity term is an important
component of a real summarization system, our evalu-
ation was conducted in an answer-locating framework,
and in this context the fidelity term—enforcing that the
summary be similar to the entire document from which
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 7 8 9 10
iteration
mo
de
l w
eig
ht
uniform corpus FAQ nearby answers answer
-6.9
-6.8
-6.7
-6.6
-6.5
-6.4
-6.3
1 2 3 4 5 6 7 8 9 10
Iteration
Lo
g-
lik
eli
ho
od
Test Training
Figure 5: Estimating the weights of the five constituent models in (6) using the EM algorithm. The values here were computed
using a single, randomly-selected 70% portion of the Usenet FAQ dataset. Left: The weights a20 for the models are initialized to
a33a14a34a36a35 , but within a few iterations settle to their final values. Right: The progression of the likelihood of the training data during
the execution of the EM algorithm; almost all of the improvement comes in the first five iterations.
a20 a226 a20a128a224 a20a223a221 a20a123a219 a20a223a218
Usenet FAQ a37a39a38a40a42a41a36a43 a37a32a38a37a36a41a3a44 a37a32a38 a33a46a45 a40 a37a39a38a45a3a47a48a35 a37
call-center a37a39a38 a33a3a33 a43 a37a32a38a37a36a37 a45 a37a32a38a45 a37a3a43 a37a39a38a45 a37a3a44 a37a32a38a37 a47 a41
Summary
29%
Neighbors
10%
Document
14%
Corpus
47%
Uniform
0%
Summary
11%
Neighbors
0%
Document
40%
Corpus
42%
Uniform
7%
Figure 6: Maximum-likelihood weights for the various
components of the relevance model a212a40a35a38a33 a213a91a84a82a42 . Left: Weights
assigned to the constituent models from the Usenet FAQ
data. Right: Corresponding breakdown for the call-center
data. These weights were calculated using shrinkage.
it was drawn—is not so important.
From a set of rankings a48a48a49 a155 a49 a49 a157 a49a145a158a149a158a149a158 a49a42a50 a53 , one can
measure the the quality of a ranking algorithm using
the harmonic mean rank:
a51 def
a100 a135
a16
a50
a172a38a179
a155
a155
a52
a222
A lower number indicates better performance; a51 a100a88a170 ,
which is optimal, means that the algorithm consis-
tently assigns the first rank to the correct answer. Ta-
ble 1 shows the harmonic mean rank on the two col-
lections. The third column of Table 1 shows the result
of a QRS system using a uniform fidelity model, the
fourth corresponds to a standard tfidf-based ranking
method (Ponte, 1998), and the last column reflects the
performance of randomly guessing the correct sum-
mary from all answers in the document.
trial # trials LM tfidf random
Usenet 1 554 1.41 2.29 4.20
FAQ 2 549 1.38 2.42 4.25
data 3 535 1.40 2.30 4.19
Call 1 1020 4.8 38.7 1335
center 2 1055 4.0 22.6 1335
data 3 1037 4.2 26.0 1321
Table 1: Performance of query-relevant extractive summa-
rization on the Usenet and call-center datasets. The numbers
reported in the three rightmost columns are harmonic mean
ranks: lower is better.
4 Extensions
4.1 Question-answering
The reader may by now have realized that our approach
to the QRS problem may be portable to the problem of
question-answering. By question-answering, we mean
a system which automatically extracts from a poten-
tially lengthy document (or set of documents) the an-
swer to a user-specified question. Devising a high-
quality question-answering system would be of great
service to anyone lacking the inclination to read an
entire user’s manual just to find the answer to a sin-
gle question. The success of the various automated
question-answering services on the Internet (such as
AskJeeves) underscores the commercial importance
of this task.
One can cast answer-finding as a traditional docu-
ment retrieval problem by considering each candidate
answer as an isolated document and ranking each can-
didate answer by relevance to the query. Traditional
tfidf-based ranking of answers will reward candidate
answers with many words in common with the query.
Employing traditional vector-space retrieval to find an-
swers seems attractive, since tfidf is a standard, time-
tested algorithm in the toolbox of any IR professional.
What this paper has described is a first step towards
more sophisticated models of question-answering.
First, we have dispensed with the simplifying assump-
tion that the candidate answers are independent of one
another by using a model which explicitly accounts
for the correlation between text blocks—candidate
answers—within a single document. Second, we have
put forward a principled statistical model for answer-
ranking; a101a54a103a99a104a50a106a108a101a110a109a32a53 a96a112a89a97a46 a98a40a44a50a49a52a45a83a92 has a probabilistic in-
terpretation as the best answer to a45 within a44 is a46 .
Question-answering and query-relevant summariza-
tion are of course not one and the same. For one, the
criterion of containing an answer to a question is rather
stricter than mere relevance. Put another way, only a
small number of documents actually contain the an-
swer to a given query, while every document can in
principle be summarized with respect to that query.
Second, it would seem that the a96a112a89a97a46a50a98a99a44a40a92 term, which
acts as a prior on summaries in (1), is less appropriate
in a question-answering setting, where it is less impor-
tant that a candidate answer to a query bears resem-
blance to the document containing it.
4.2 Generic summarization
Although this paper focuses on the task of query-
relevant summarization, the core ideas—formulating
a probabilistic model of the problem and learning
the values of this model automatically from FAQ-like
data—are equally applicable to generic summariza-
tion. In this case, one seeks the summary which best
typifies the document. Applying Bayes’ rule as in (1),
a46a105a138 a139
a101a54a103a99a104a140a106a108a101a54a109
a111
a96a112a89a91a46a112a98a105a44a143a92
a100 a101a54a103a99a104a140a106a108a101a54a109
a111
a96a112a89a97a44a241a98a99a46a149a92
a146 a147a149a148 a150
generative
a96a112a89a97a46a149a92
a146 a147a149a148 a150
prior
(6)
The first term on the right is a generative model of doc-
uments from summaries, and the second is a prior dis-
tribution over summaries. One can think of this factor-
ization in terms of a dialogue. Alice, a newspaper edi-
tor, has an idea a46 for a story, which she relates to Bob.
Bob researches and writes the story a44 , which we can
view as a “corruption” of Alice’s original idea a46 . The
task of generic summarization is to recover a46 , given
only the generated document a44 , a model a96a112a89a97a44a241a98a99a46a149a92 of
how the Alice generates summaries from documents,
and a prior distributiona96a112a89a91a46a145a92 on ideas a46 .
The central problem in information theory is reliable
communication through an unreliable channel. We can
interpret Alice’s idea a46 as the original signal, and the
process by which Bob turns this idea into a document
a44 as the channel, which corrupts the original message.
The summarizer’s task is to “decode” the original, con-
densed message from the document.
We point out this source-channel perspective be-
cause of the increasing influence that information the-
ory has exerted on  and information-related
applications. For instance, the source-channel model
has been used for non-extractive summarization, gen-
erating titles automatically from news articles (Wit-
brock and Mittal, 1999).
The factorization in (6) is superficially similar to (1),
but there is an important difference: a185 a89a91a44a108a98a105a46a149a92 is a gener-
ative, from a summary to a larger document, whereas
a185
a89a97a45a142a98a105a46a145a92 is compressive, from a summary to a smaller
query. This distinction is likely to translate in prac-
tice into quite different statistical models and training
procedures in the two cases.
5 Summary
The task of summarization is difficult to define and
even more difficult to automate. Historically, a re-
warding line of attack for automating -related
problems has been to take a machine learning perspec-
tive: let a computer learn how to perform the task by
“watching” a human perform it many times. This is
the strategy we have pursued here.
There has been some work on learning a probabilis-
tic model of summarization from text; some of the ear-
liest work on this was due to Kupiec et al. (1995),
who used a collection of manually-summarized text
to learn the weights for a set of features used in a
generic summarization system. Hovy and Lin (1997)
present another system that learned how the position
of a sentence affects its suitability for inclusion in
a summary of the document. More recently, there
has been work on building more complex, structured
models—probabilistic syntax trees—to compress sin-
gle sentences (Knight and Marcu, 2000). Mani and
Bloedorn (1998) have recently proposed a method for
automatically constructing decision trees to predict
whether a sentence should or should not be included
in a document’s summary. These previous approaches
focus mainly on the generic summarization task, not
query relevant summarization.
The  modelling approach described here
does suffer from a common flaw within text processing
systems: the problem of synonymy. A candidate an-
swer containing the termConstantinopleis likely
to be relevant to a question about Istanbul, but rec-
ognizing this correspondence requires a step beyond
word frequency histograms. Synonymy has received
much attention within the document retrieval com-
munity recently, and researchers have applied a vari-
ety of heuristic and statistical techniques—including
pseudo-relevance feedback and local context analy-
sis (Efthimiadis and Biron, 1994; Xu and Croft, 1996).
Some recent work in statistical IR has extended the ba-
sic  modelling approaches to account for word
synonymy (Berger and Lafferty, 1999).
This paper has proposed the use of two novel
datasets for summarization: the frequently-asked
questions (FAQs) from Usenet archives and ques-
tion/answer pairs from the call centers of retail compa-
nies. Clearly this data isn’t a perfect fit for the task of
building a QRS system: after all, answers are not sum-
maries. However, we believe that the FAQs represent a
reasonable source of query-related document conden-
sations. Furthermore, using FAQs allows us to assess
the effectiveness of applying standard statistical learn-
ing machinery—maximum-likelihood estimation, the
EM algorithm, and so on—to the QRS problem. More
importantly, it allows us to evaluate our results in a rig-
orous, non-heuristic way. Although this work is meant
as an opening salvo in the battle to conquer summa-
rization with quantitative, statistical weapons, we ex-
pect in the future to enlist linguistic, semantic, and
other non-statistical tools which have shown promise
in condensing text.
Acknowledgments
This research was supported in part by an IBM Univer-
sity Partnership Award and by Claritech Corporation.
The authors thank Right Now Tech for the use of the
call-center question database. We also acknowledge
thoughtful comments on this paper by Inderjeet Mani.
References
A. Berger and J. Lafferty. 1999. Information retrieval
as statistical translation. In Proc. of ACM SIGIR-99.
A. Berger, P. Brown, S. Della Pietra, V. Della Pietra,
J. Gillett, J. Lafferty, H. Printz, and L. Ures. 1994.
The CANDIDE system for machine translation. In
Proc. of the ARPA Human Language Technology
Workshop.
Y. Chali, S. Matwin, and S. Szpakowicz. 1999. Query-
biased text summarization as a question-answering
technique. In Proc. of the AAAI Fall Symp. on Ques-
tion Answering Systems, pages 52–56.
A. Dempster, N. Laird, and D. Rubin. 1977. Max-
imum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
39B:1–38.
E. Efthimiadis and P. Biron. 1994. UCLA-Okapi at
TREC-2: Query expansion experiments. In Proc. of
the Text Retrieval Conference (TREC-2).
E. Hovy and C. Lin. 1997. Automated text summa-
rization in SUMMARIST. In Proc. of the ACL Wkshp
on Intelligent Text Summarization, pages 18–24.
F. Jelinek. 1997. Statistical methods for speech recog-
nition. MIT Press.
K. Knight and D. Marcu. 2000. Statistics-based
summarization—Step one: Sentence compression.
In Proc. of AAAI-00. AAAI.
J. Kupiec, J. Pedersen, and F. Chen. 1995. A trainable
document summarizer. In Proc. SIGIR-95, pages
68–73, July.
Chin-Yew Lin. 1999. Training a selection function
for extraction. In Proc. of the Eighth ACM CIKM
Conference, Kansas City, MO.
I. Mani and E. Bloedorn. 1998. Machine learning of
generic and user-focused summarization. In Proc.
of AAAI-98, pages 821–826.
J. Ponte and W. Croft. 1998. A  modeling ap-
proach to information retrieval. In Proc. of SIGIR-
98, pages 275–281.
J. Ponte. 1998. A  modelling approach to
information retrieval. Ph.D. thesis, University of
Massachusetts at Amherst.
S. Sato and M. Sato. 1998. Rewriting saves extracted
summaries. In Proc. of the AAAI Intelligent Text
Summarization Workshop, pages 76–83.
C. Stein. 1955. Inadmissibility of the usual estimator
for the mean of a multivariate normal distribution.
In Proc. of the Third Berkeley symposium on mathe-
matical statistics and probability, pages 197–206.
F. Wancho. 1990. RFC 1153: Digest message format.
M. Witbrock and V. Mittal. 1999. Headline Genera-
tion: A framework for generating highly-condensed
non-extractive summaries. In Proc. of ACM SIGIR-
99, pages 315–316.
J. Xu and B. Croft. 1996. Query expansion using lo-
cal and global document analysis. In Proc. of ACM
SIGIR-96.
