Is It the Right Answer?
Exploiting Web Redundancy for Answer Validation
Bernardo Magnini, Matteo Negri, Roberto Prevete and Hristo Tanev
ITC-Irst, Centro per la Ricerca Scientifica e Tecnologica
[magnini,negri,prevete,tanev]@itc.it
Abstract
Answer Validation is an emerging topic
in Question Answering, where open do-
main systems are often required to rank
huge amounts of candidate answers. We
present a novel approach to answer valida-
tion based on the intuition that the amount
of implicit knowledge which connects an
answer to a question can be quantitatively
estimated by exploiting the redundancy of
Web information. Experiments carried out
on the TREC-2001 judged-answer collec-
tion show that the approach achieves a
high level of performance (i.e. 81% suc-
cess rate). The simplicity and the effi-
ciency of this approach make it suitable to
be used as a module in Question Answer-
ing systems.
1 Introduction
Open domain question-answering (QA) systems
search for answers to a natural language question
either on the Web or in a local document collec-
tion. Different techniques, varying from surface pat-
terns (Subbotin and Subbotin, 2001) to deep seman-
tic analysis (Zajac, 2001), are used to extract the text
fragments containing candidate answers. Several
systems apply answer validation techniques with the
goal of filtering out improper candidates by check-
ing how adequate a candidate answer is with re-
spect to a given question. These approaches rely
on discovering semantic relations between the ques-
tion and the answer. As an example, (Harabagiu
and Maiorano, 1999) describes answer validation as
an abductive inference process, where an answer is
valid with respect to a question if an explanation for
it, based on background knowledge, can be found.
Although theoretically well motivated, the use of se-
mantic techniques on open domain tasks is quite ex-
pensive both in terms of the involved linguistic re-
sources and in terms of computational complexity,
thus motivating a research on alternative solutions
to the problem.
This paper presents a novel approach to answer
validation based on the intuition that the amount of
implicit knowledge which connects an answer to a
question can be quantitatively estimated by exploit-
ing the redundancy of Web information. The hy-
pothesis is that the number of documents that can
be retrieved from the Web in which the question and
the answer co-occur can be considered a significant
clue of the validity of the answer. Documents are
searched in the Web by means of validation pat-
terns, which are derived from a linguistic process-
ing of the question and the answer. In order to test
this idea a system for automatic answer validation
has been implemented and a number of experiments
have been carried out on questions and answers pro-
vided by the TREC-2001 participants. The advan-
tages of this approach are its simplicity on the one
hand and its efficiency on the other.
Automatic techniques for answer validation are
of great interest for the development of open do-
main QA systems. The availability of a completely
automatic evaluation procedure makes it feasible
QA systems based on generate and test approaches.
In this way, until a given answer is automatically
                Computational Linguistics (ACL), Philadelphia, July 2002, pp. 425-432.
                         Proceedings of the 40th Annual Meeting of the Association for
proved to be correct for a question, the system will
carry out different refinements of its searching crite-
ria checking the relevance of new candidate answers.
In addition, given that most of the QA systems rely
on complex architectures and the evaluation of their
performances requires a huge amount of work, the
automatic assessment of the relevance of an answer
with respect to a given question will speed up both
algorithm refinement and testing.
The paper is organized as follows. Section 2
presents the main features of the approach. Section 3
describes how validation patterns are extracted from
a question-answer pair by means of specific question
answering techniques. Section 4 explains the basic
algorithm for estimating the answer validity score.
Section 5 gives the results of a number of experi-
ments and discusses them. Finally, Section 6 puts
our approach in the context of related works.
2 Overall Methodology
Given a question a0 and a candidate answer a1 the an-
swer validation task is defined as the capability to as-
sess the relevance of a1 with respect to a0 . We assume
open domain questions and that both answers and
questions are texts composed of few tokens (usually
less than 100). This is compatible with the TREC-
2001 data, that will be used as examples throughout
this paper. We also assume the availability of the
Web, considered to be the largest open domain text
corpus containing information about almost all the
different areas of the human knowledge.
The intuition underlying our approach to an-
swer validation is that, given a question-answer pair
([a0 ,a1 ]), it is possible to formulate a set of valida-
tion statements whose truthfulness is equivalent to
the degree of relevance of a1 with respect to a0 . For
instance, given the question “What is the capital of
the USA?”, the problem of validating the answer
“Washington” is equivalent to estimating the truth-
fulness of the validation statement “The capital of
the USA is Washington”. Therefore, the answer
validation task could be reformulated as a problem
of statement reliability. There are two issues to be
addressed in order to make this intuition effective.
First, the idea of a validation statement is still insuf-
ficient to catch the richness of implicit knowledge
that may connect an answer to a question: we will
attack this problem defining the more flexible idea
of a validation pattern. Second, we have to design
an effective and efficient way to check the reliability
of a validation pattern: our solution relies on a pro-
cedure based on a statistical count of Web searches.
Answers may occur in text passages with low
similarity with respect to the question. Passages
telling facts may use different syntactic construc-
tions, sometimes are spread in more than one sen-
tence, may reflect opinions and personal attitudes,
and often use ellipsis and anaphora. For instance, if
the validation statement is “The capital of USA is
Washington”, we have Web documents containing
passages like those reported in Table 1, which can
not be found with a simple search of the statement,
but that nevertheless contain a significant amount of
knowledge about the relations between the question
and the answer. We will refer to these text fragments
as validation fragments.
1. Capital Region USA: Fly-Drive Holidays in
and Around Washington D.C.
2. the Insider’s Guide to the Capital Area Music
Scene (Washington D.C., USA).
3. The Capital Tangueros (Washington, DC
Area, USA)
4. I live in the Nation’s Capital, Washington
Metropolitan Area (USA).
5. in 1790 Capital (also USA’s capital): Wash-
ington D.C. Area: 179 square km
Table 1: Web search for validation fragments
A common feature in the above examples is the
co-occurrence of a certain subset of words (i.e.
“capital”,“USA” and “Washington”). We will make
use of validation patterns that cover a larger portion
of text fragments, including those lexically similar
to the question and the answer (e.g. fragments 4 and
5 in Table 1) and also those that are not similar (e.g.
fragment 2 in Table 1). In the case of our example
a set of validation statements can be generalized by
the validation pattern:
[capital a2 texta3 USA a2 texta3 Washington]
where a2 texta3 is a place holder for any portion of
text with a fixed maximal length.
To check the correctness of a1 with respect to a0
we propose a procedure that measures the number
of occurrences on the Web of a validation pattern
derived from a1 and a0 . A useful feature of such pat-
terns is that when we search for them on the Web
they usually produce many hits, thus making statis-
tical approaches applicable. In contrast, searching
for strict validation statements generally results in a
small number of documents (if any) and makes sta-
tistical methods irrelevant. A number of techniques
used for finding collocations and co-occurrences of
words, such as mutual information, may well be
used to search co-occurrence tendency between the
question and the candidate answer in the Web. If we
verify that such tendency is statistically significant
we may consider the validation pattern as consistent
and therefore we may assume a high level of correla-
tion between the question and the candidate answer.
Starting from the above considerations and given
a question-answer pair a4a0a6a5 a1a8a7 , we propose an answer
validation procedure based on the following steps:
1. Compute the set of representative keywords
a9
a0 and
a9
a1 both from
a0 and from
a1 ; this step is
carried out using linguistic techniques, such as
answer type identification (from the question)
and named entities recognition (from the an-
swer);
2. From the extracted keywords compute the vali-
dation pattern for the pair [a0a6a5 a1 ];
3. Submit the patterns to the Web and estimate an
answer validity score considering the number
of retrieved documents.
3 Extracting Validation Patterns
In our approach a validation pattern consists of two
components: a question sub-pattern (Qsp) and an
answer sub-pattern (Asp).
Building the Qsp. A Qsp is derived from the input
question cutting off non-content words with a stop-
words filter. The remaining words are expanded
with both synonyms and morphological forms in
order to maximize the recall of retrieved docu-
ments. Synonyms are automatically extracted from
the most frequent sense of the word in WordNet
(Fellbaum, 1998), which considerably reduces the
risk of adding disturbing elements. As for morphol-
ogy, verbs are expanded with all their tense forms
(i.e. present, present continuous, past tense and past
participle). Synonyms and morphological forms are
added to the Qsp and composed in an OR clause.
The following example illustrates how the Qsp
is constructed. Given the TREC-2001 question
“When did Elvis Presley die?”, the stop-words filter
removes “When” and “did” from the input. Then
synonyms of the first sense of “die” (i.e. “decease”,
“perish”, etc.) are extracted from WordNet. Finally,
morphological forms for all the corresponding verb
tenses are added to the Qsp. The resultant Qsp will
be the following:
[Elvis a2 texta3 Presley a2 texta3 (die OR died OR
dying OR perish OR ...)]
Building the Asp. An Asp is constructed in two
steps. First, the answer type of the question is iden-
tified considering both morpho-syntactic (a part of
speech tagger is used to process the question) and
semantic features (by means of semantic predicates
defined on the WordNet taxonomy; see (Magnini et
al., 2001) for details). Possible answer types are:
DATE, MEASURE, PERSON, LOCATION, ORGANI-
ZATION, DEFINITION and GENERIC. DEFINITION
is the answer type peculiar to questions like “What
is an atom?” which represent a considerable part
(around 25%) of the TREC-2001 corpus. The an-
swer type GENERIC is used for non definition ques-
tions asking for entities that can not be classified as
named entities (e.g. the questions: “Material called
linen is made from what plant?” or “What mineral
helps prevent osteoporosis?”)
In the second step, a rule-based named entities
recognition module identifies in the answer string
all the named entities matching the answer type cat-
egory. If the category corresponds to a named en-
tity, an Asp for each selected named entity is cre-
ated. If the answer type category is either DEFINI-
TION or GENERIC, the entire answer string except
the stop-words is considered. In addition, in order
to maximize the recall of retrieved documents, the
Asp is expanded with verb tenses. The following
example shows how the Asp is created. Given the
TREC question “When did Elvis Presley die?” and
the candidate answer “though died in 1977 of course
some fans maintain”, since the answer type category
is DATE the named entities recognition module will
select [1977] as an answer sub-pattern.
4 Estimating Answer Validity
The answer validation algorithm queries the Web
with the patterns created from the question and an-
swer and after that estimates the consistency of the
patterns.
4.1 Querying the Web
We use a Web-mining algorithm that considers the
number of pages retrieved by the search engine. In
contrast, qualitative approaches to Web mining (e.g.
(Brill et al., 2001)) analyze the document content,
as a result considering only a relatively small num-
ber of pages. For information retrieval we used the
AltaVista search engine. Its advanced syntax allows
the use of operators that implement the idea of vali-
dation patterns introduced in Section 2. Queries are
composed using NEAR, OR and AND boolean opera-
tors. The NEAR operator searches pages where two
words appear in a distance of no more than 10 to-
kens: it is used to put together the question and the
answer sub-patterns in a single validation pattern.
The OR operator introduces variations in the word
order and verb forms. Finally, the AND operator is
used as an alternative to NEAR, allowing more dis-
tance among pattern elements.
If the question sub-pattern a10a12a11a14a13 does not return
any document or returns less than a certain thresh-
old (experimentally set to 7) the question pattern
is relaxed by cutting one word; in this way a new
query is formulated and submitted to the search en-
gine. This is repeated until no more words can be
cut or the returned number of documents becomes
higher than the threshold. Pattern relaxation is per-
formed using word-ignoring rules in a specified or-
der. Such rules, for instance, ignore the focus of the
question, because it is unlikely that it occurs in a
validation fragment; ignore adverbs and adjectives,
because are less significant; ignore nouns belonging
to the WordNet classes “abstraction”, “psychologi-
cal feature” or “group”, because usually they specify
finer details and human attitudes. Names, numbers
and measures are preferred over all the lower-case
words and are cut last.
4.2 Estimating pattern consistency
The Web-mining module submits three searches to
the search engine: the sub-patterns [Qsp] and [Asp]
and the validation pattern [QAp], this last built as
the composition [Qsp NEAR Asp]. The search en-
gine returns respectively: a15a17a16a19a18a20a11a22a21a23a10a12a11a14a13a17a24 , a15a17a16a25a18a26a11a22a21a25a27a28a11a29a13a17a24
and a15a17a16a25a18a26a11a22a21a30a10a31a11a14a13 NEAR a27a32a11a14a13a33a24 . The probability a34a35a21a19a27a32a24
of a pattern a27 in the Web is calculated by:
a34a35a21a19a27a32a24a37a36
a15a33a16a19a18a20a11a22a21a19a27a32a24
a38
a1a22a39a33a34a40a1a42a41a17a43a44a11
where a15a33a16a19a18a26a11a6a21a19a27a32a24 is the number of pages in the Web
where a27 appears and a38 a1a6a39a17a34a40a1a8a41a17a43a44a11 is the maximum
number of pages that can be returned by the search
engine. We set this constant experimentally. How-
ever in two of the formulas we use (i.e. Point-
wise Mutual Information and Corrected Conditional
Probability) a38 a1a6a39a17a34a40a1a8a41a45a43a46a11 may be ignored.
The joint probability P(Qsp,Asp) is calculated by
means of the validation pattern probability:
a34a35a21a23a10a31a27a47a13a17a24a37a36a48a34a35a21a30a10a31a11a14a13a12a49a6a50a6a51a22a52a40a27a28a11a29a13a17a24
We have tested three alternative measures to es-
timate the degree of relevance of Web searches:
Pointwise Mutual Information, Maximal Likelihood
Ratio and Corrected Conditional Probability, a vari-
ant of Conditional Probability which considers the
asymmetry of the question-answer relation. Each
measure provides an answer validity score: high val-
ues are interpreted as strong evidence that the vali-
dation pattern is consistent. This is a clue to the fact
that the Web pages where this pattern appears con-
tain validation fragments, which imply answer accu-
racy.
Pointwise Mutual Information (PMI) (Manning
and Sch¨utze, 1999) has been widely used to find co-
occurrence in large corpora.
a34
a38a54a53
a21 Qsp,Aspa24a37a36
a34a35a21 Qsp,Aspa24
a34a35a21 Qspa24a56a55a57a34a35a21 Aspa24
PMI(Qsp,Asp) is used as a clue to the internal
coherence of the question-answer validation pattern
QAp. Substituting the probabilities in the PMI for-
mula with the previously introduced Web statistics,
we obtain:
a15a17a16a25a18a26a11a22a21 Qspa49a6a50a22a51a6a52 Aspa24
a15a17a16a25a18a26a11a22a21 Qspa24a56a55a57a15a33a16a19a18a26a11a6a21 Aspa24
a55
a38
a1a6a39a17a34a40a1a8a41a45a43a46a11
Maximal Likelihood Ratio (MLHR) is also used
for word co-occurrence mining (Dunning, 1993).
We decided to check MLHR for answer validation
because it is supposed to outperform PMI in case
of sparse data, a situation that may happen in case
of questions with complex patterns that return small
number of hits.
a38a54a58a60a59a62a61
a21a23a10a12a11a14a13
a5
a27a28a11a29a13a17a24a63a36a65a64a67a66a37a68a70a69a72a71a67a73
a73a74a36
a58
a21a70a13
a5a76a75a78a77a79a5a76a80a81a77
a24
a58
a21a70a13
a5a76a75a22a82a83a5a76a80a84a82
a24
a58
a21a70a13
a77a44a5a76a75a78a77a79a5a76a80a81a77
a24
a58
a21a70a13
a82a46a5a79a75a6a82a44a5a76a80a84a82
a24
where a58 a21a70a13 a5a79a75a84a5a79a80 a24a56a36a85a13a17a86a6a21a20a87a67a64a88a13a17a24a20a89a91a90a17a86
a13
a77
a36 a86a76a92
a89a91a92
,a13 a82 a36 a86a20a93
a89a46a93
a13a35a36 a86a94a92a26a95a84a86a20a93
a89a91a92a26a95a33a89a44a93
a75 a77
a36a96a15a17a16a19a18a20a11a22a21a23a10a12a11a14a13
a5
a27a28a11a29a13a17a24 ,
a75 a82
a36a96a15a17a16a19a18a20a11a22a21a23a10a12a11a14a13
a5
a64a67a27a32a11a14a13a17a24
a80a81a77
a36a48a15a33a16a19a18a26a11a6a21a19a27a32a11a14a13a17a24 ,
a80a97a82
a36a48a15a33a16a19a18a20a11a22a21a20a64a67a27a32a11a14a13a17a24
Here a15a33a16a19a18a20a11a22a21a23a10a12a11a14a13
a5
a64a67a27a32a11a14a13a33a24 is the number of
appearances of Qsp when Asp is not present and
it is calculated asa15a17a16a25a18a26a11a6a21a23a10a31a11a29a13a17a24a42a64a40a15a17a16a19a18a20a11a22a21a23a10a12a11a14a13a98a49a22a50a6a51a22a52a67a27a32a11a14a13a17a24 .
Similarly, a15a33a16a19a18a20a11a22a21a20a64a67a27a32a11a14a13a17a24 is the number of Web
pages where Asp does not appear and it is calculated
as a38 a1a6a39a17a34a40a1a8a41a45a43a46a11a32a64a99a27a28a11a29a13 .
Corrected Conditional Probability (CCP) in
contrast with PMI and MLHR, CCP is not
symmetric (e.g. generally a100a101a100a31a34a35a21a30a10a31a11a14a13 a5 a27a32a11a14a13a17a24a103a102a36
a100a101a100a12a34a35a21a19a27a32a11a14a13
a5
a10a31a11a14a13a33a24 ). This is based on the fact that
we search for the occurrence of the answer pattern
Asp only in the cases when Qsp is present. The sta-
tistical evidence for this can be measured through
a34a35a21a19a27a32a11a14a13a63a104a10a31a11a14a13a33a24 , however this value is corrected with
a34a35a21a19a27a32a11a14a13a17a24
a82a14a105a20a106
in the denominator, to avoid the cases
when high-frequency words and patterns are taken
as relevant answers.
a100a107a100a31a34a35a21a23a10a12a11a14a13
a5
a27a28a11a29a13a17a24a56a36
a34a35a21a19a27a32a11a14a13a63a104a10a12a11a14a13a17a24
a34a35a21a25a27a28a11a29a13a17a24
a82a14a105a20a106
For CCP we obtain:
a15a33a16a19a18a20a11a22a21a23a10a12a11a14a13a107a49a22a50a6a51a22a52a101a27a32a11a14a13a33a24
a15a33a16a19a18a26a11a6a21a23a10a31a11a29a13a17a24a56a55a57a15a33a16a19a18a20a11a22a21a19a27a32a11a14a13a17a24
a82a14a105a20a106
a55
a38
a1a6a39a17a34a40a1a8a41a17a43a44a11
a82a29a105a20a106
4.3 An example
Consider an example taken from the question an-
swer corpus of the main task of TREC-2001:
“Which river in US is known as Big Muddy?”. The
question keywords are: “river”, “US”, “known”,
“Big”, “Muddy”. The search of the pattern [river
NEAR US NEAR (known OR know OR...) NEAR Big
NEAR Muddy] returns 0 pages, so the algorithm re-
laxes the pattern by cutting the initial noun “river”,
according to the heuristic for discarding a noun if it
is the first keyword of the question. The second pat-
tern [US NEAR (known OR know OR...) NEAR Big
NEAR Muddy] also returns 0 pages, so we apply the
heuristic for ignoring verbs like “know”, “call” and
abstract nouns like “name”. The third pattern [US
NEAR Big NEAR Muddy] returns 28 pages, which is
over the experimentally set threshold of seven pages.
One of the 50 byte candidate answers from the
TREC-2001 answer collection is “recover Missis-
sippi River”. Taking into account the answer type
LOCATION, the algorithm considers only the named
entity: “Mississippi River”. To calculate answer
validity score (in this example PMI) for [Missis-
sippi River], the procedure constructs the validation
pattern: [US NEAR Big NEAR Muddy NEAR Mis-
sissippi River] with the answer sub-pattern [Missis-
sippi River]. These two patterns are passed to the
search engine, and the returned numbers of pages
are substituted in the mutual information expression
at the places of a15a33a16a19a18a20a11a22a21a23a10a12a11a14a13a67a49a6a50a6a51a22a52a108a27a28a11a29a13a17a24 and a15a17a16a25a18a26a11a22a21a25a27a28a11a29a13a17a24
respectively; the previously obtained number (i.e.
28) is substituted at the place of a15a17a16a25a18a26a11a22a21a30a10a31a11a14a13a33a24 . In this
way an answer validity score of 55.5 is calculated.
It turns out that this value is the maximal validity
score for all the answers of this question. Other cor-
rect answers from the TREC-2001 collection con-
tain as name entity “Mississippi”. Their answer va-
lidity score is 11.8, which is greater than 1.2 and
also greater than a109a45a110a111a66a107a55 a38 a1a6a39a17a16a113a112a88a1a22a114 a115a12a1a6a114a19a16a117a116a6a16a19a18a26a118 a119a60a120a83a121a42a122a42a43
a21a19a36a123a87a72a87a72a110a124a87a44a24 . This score (i.e. 11.8) classifies them as
relevant answers. On the other hand, all the wrong
answers has validity score below 1 and as a result
all of them are classified as irrelevant answer candi-
dates.
5 Experiments and Discussion
A number of experiments have been carried out in
order to check the validity of the proposed answer
validation technique. As a data set, the 492 ques-
tions of the TREC-2001 database have been used.
For each question, at most three correct answers and
three wrong answers have been randomly selected
from the TREC-2001 participants’ submissions, re-
sulting in a corpus of 2726 question-answer pairs
(some question have less than three positive answers
in the corpus). As said before, AltaVista was used as
search engine.
A baseline for the answer validation experiment
was defined by considering how often an answer oc-
curs in the top 10 documents among those (1000
for each question) provided by NIST to TREC-2001
participants. An answer was judged correct for a
question if it appears at least one time in the first
10 documents retrieved for that question, otherwise
it was judged not correct. Baseline results are re-
ported in Table 2.
We carried out several experiments in order to
check a number of working hypotheses. Three in-
dependent factors were considered:
Estimation method. We have implemented three
measures (reported in Section 4.2) to estimate an an-
swer validity score: PMI, MLHR and CCP.
Threshold. We wanted to estimate the role of two
different kinds of thresholds for the assessment of
answer validation. In the case of an absolute thresh-
old, if the answer validity score for a candidate an-
swer is below the threshold, the answer is considered
wrong, otherwise it is accepted as relevant. In a sec-
ond type of experiment, for every question and its
corresponding answers the program chooses the an-
swer with the highest validity score and calculates a
relative threshold on that basis (i.e. a18a26a15a17a122a42a43a46a11a44a15a17a121a42a114a19a116a125a36
a75
a55
a38
a1a22a39 a115a12a1a22a114a19a16a113a116a22a16a25a18a26a118 a11a44a120a83a121a42a122a42a43 ). However the relative
threshold should be larger than a certain minimum
value.
Question type. We wanted to check performance
variation based on different types of TREC-2001
questions. In particular, we have separated defini-
tion and generic questions from true named entities
questions.
Tables 2 and 3 report the results of the automatic
answer validation experiments obtained respectively
on all the TREC-2001 questions and on the subset
of definition and generic questions. For each esti-
mation method we report precision, recall and suc-
cess rate. Success rate best represents the perfor-
mance of the system, being the percent of [a0a6a5 a1 ] pairs
where the result given by the system is the same as
the TREC judges’ opinion. Precision is the percent
of a4a0a6a5 a1a8a7 pairs estimated by the algorithm as rele-
vant, for which the opinion of TREC judges was the
same. Recall shows the percent of the relevant an-
swers which the system also evaluates as relevant.
P (%) R (%) SR (%)
Baseline 50.86 4.49 52.99
CCP - rel. 77.85 82.60 81.25
CCP - abs. 74.12 81.31 78.42
PMI - rel. 77.40 78.27 79.56
PMI - abs. 70.95 87.17 77.79
MLHR - rel. 81.23 72.40 79.60
MLHR - abs. 72.80 80.80 77.40
Table 2: Results on all 492 TREC-2001 questions
P (%) R (%) SR (%)
CCP - rel. 85.12 84.27 86.38
CCP - abs. 83.07 78.81 83.35
PMI - rel. 83.78 82.12 84.90
PMI - abs. 79.56 84.44 83.35
MLHR - rel. 90.65 72.75 84.44
MLHR - abs. 87.20 67.20 82.10
Table 3: Results on 249 named entity questions
The best results on the 492 questions corpus (CCP
measure with relative threshold) show a success rate
of 81.25%, i.e. in 81.25% of the pairs the system
evaluation corresponds to the human evaluation, and
confirms the initial working hypotheses. This is 28%
above the baseline success rate. Precision and re-
call are respectively 20-30% and 68-87% above the
baseline values. These results demonstrate that the
intuition behind the approach is motivated and that
the algorithm provides a workable solution for an-
swer validation.
The experiments show that the average difference
between the success rates obtained for the named
entity questions (Table 3) and the full TREC-2001
question set (Table 2) is 5.1%. This means that our
approach performs better when the answer entities
are well specified.
Another conclusion is that the relative threshold
demonstrates superiority over the absolute threshold
in both test sets (average 2.3%). However if the per-
cent of the right answers in the answer set is lower,
then the efficiency of this approach may decrease.
The best results in both question sets are ob-
tained by applying CCP. Such non-symmetric for-
mulas might turn out to be more applicable in gen-
eral. As conditional corrected (CCP) is not a clas-
sical co-occurrence measure like PMI and MLHR,
we may consider its high performance as proof
for the difference between our task and classic co-
occurrence mining. Another indication for this is the
fact that MLHR and PMI performances are compa-
rable, however in the case of classic co-occurrence
search, MLHR should show much better success
rate. It seems that we have to develop other mea-
sures specific for the question-answer co-occurrence
mining.
6 Related Work
Although there is some recent work addressing the
evaluation of QA systems, it seems that the idea of
using a fully automatic approach to answer valida-
tion has still not been explored. For instance, the
approach presented in (Breck et al., 2000) is semi-
automatic. The proposed methodology for answer
validation relies on computing the overlapping be-
tween the system response to a question and the
stemmed content words of an answer key. All the
answer keys corresponding to the 198 TREC-8 ques-
tions have been manually constructed by human an-
notators using the TREC corpus and external re-
sources like the Web.
The idea of using the Web as a corpus is an
emerging topic of interest among the computational
linguists community. The TREC-2001 QA track
demonstrated that Web redundancy can be exploited
at different levels in the process of finding answers
to natural language questions. Several studies (e.g.
(Clarke et al., 2001) (Brill et al., 2001)) suggest that
the application of Web search can improve the preci-
sion of a QA system by 25-30%. A common feature
of these approaches is the use of the Web to intro-
duce data redundancy for a more reliable answer ex-
traction from local text collections. (Radev et al.,
2001) suggests a probabilistic algorithm that learns
the best query paraphrase of a question searching the
Web. Other approaches suggest training a question-
answering system on the Web (Mann, 2001).
The Web-mining algorithm presented in this pa-
per is similar to the PMI-IR (Pointwise Mutual
Information - Information Retrieval) described in
(Turney, 2001). Turney uses PMI and Web retrieval
to decide which word in a list of candidates is the
best synonym with respect to a target word. How-
ever, the answer validity task poses different pe-
culiarities. We search how the occurrence of the
question words influence the appearance of answer
words. Therefore, we introduce additional linguis-
tic techniques for pattern and query formulation,
such as keyword extraction, answer type extraction,
named entities recognition and pattern relaxation.
7 Conclusion and Future Work
We have presented a novel approach to answer val-
idation based on the intuition that the amount of
implicit knowledge which connects an answer to a
question can be quantitatively estimated by exploit-
ing the redundancy of Web information. Results ob-
tained on the TREC-2001 QA corpus correlate well
with the human assessment of answers’ correctness
and confirm that a Web-based algorithm provides a
workable solution for answer validation.
Several activities are planned in the near future.
First, the approach we presented is currently
based on fixed validation patterns that combine sin-
gle words extracted both from the question and from
the answer. These word-level patterns provide a
broad coverage (i.e. many documents are typically
retrieved) in spite of a low precision (i.e also weak
correlations among the keyword are captured). To
increase the precision we want to experiment other
types of patterns, which combine words into larger
units (e.g. phrases or whole sentences). We believe
that the answer validation process can be improved
both considering pattern variations (from word-level
to phrase and sentence-level), and the trade-off be-
tween the precision of the search pattern and the
number of retrieved documents. Preliminary experi-
ments confirm the validity of this hypothesis.
Then, a generate and test module based on the val-
idation algorithm presented in this paper will be in-
tegrated in the architecture of our QA system under
development. In order to exploit the efficiency and
the reliability of the algorithm, such system will be
designed trying to maximize the recall of retrieved
candidate answers. Instead of performing a deep lin-
guistic analysis of these passages, the system will
delegate to the evaluation component the selection
of the right answer.
References
E.J. Breck, J.D. Burger, L. Ferro, L. Hirschman,
D. House, M. Light, and I. Mani. 2000. How to Eval-
uate Your Question Answering System Every Day and
Still Get Real Work Done. In Proceedings of LREC-
2000, pages 1495–1500, Athens, Greece, 31 May - 2
June.
E. Brill, J. Lin, M. Banko, S. Dumais, and A. Ng.
2001. Data-Intensive Question Answering. In TREC-
10 Notebook Papers, Gaithesburg, MD.
C. Clarke, G. Cormack, T. Lynam, C. Li, and
G. McLearn. 2001. Web Reinforced Question An-
swering (MultiText Experiments for TREC 2001). In
TREC-10 Notebook Papers, Gaithesburg, MD.
T. Dunning. 1993. Accurate Methods for the Statistics of
Surprise and Coincidence. Computational Linguistics,
19(1):61–74.
C. Fellbaum. 1998. WordNet, An Electronic Lexical
Database. The MIT Press.
S. Harabagiu and S. Maiorano. 1999. Finding Answers
in Large Collections of Texts: Paragraph Indexing +
Abductive Inference. In Proceedings of the AAAI Fall
Symposium on Question Answering Systems, pages
63–71, November.
B. Magnini, M. Negri, R. Prevete, and H. Tanev. 2001.
Multilingual Question/Answering: the DIOGENE
System. In TREC-10 Notebook Papers, Gaithesburg,
MD.
G. S. Mann. 2001. A Statistical Method for Short
Answer Extraction. In Proceedings of the ACL-
2001 Workshop on Open-Domain Question Answer-
ing, Toulouse, France, July.
C.D. Manning and H. Sch¨utze. 1999. Foundations of
Statistical Natural Language Processing. The MIT
PRESS, Cambridge,Massachusets.
H. R. Radev, H. Qi, Z. Zheng, S. Blair-Goldensohn,
Z. Zhang, W. Fan, and J. Prager. 2001. Mining the
Web for Answers to Natural Language Questions. In
Proceedings of 2001 ACM CIKM, Atlanta, Georgia,
USA, November.
M. Subbotin and S. Subbotin. 2001. Patterns of Potential
Answer Expressions as Clues to the Right Answers. In
TREC-10 Notebook Papers, Gaithesburg, MD.
P.D. Turney. 2001. Mining the Web for Synonyms:
PMI-IR versus LSA on TOEFL. In Proceedings of
ECML2001, pages 491–502, Freiburg, Germany.
R. Zajac. 2001. Towards Ontological Question Answer-
ing. In Proceedings of the ACL-2001 Workshop on
Open-Domain Question Answering, Toulouse, France,
July.
