Question answering via Bayesian inference on lexical relations
Ganesh Ramakrishnan, Apurva Jadhav, Ashutosh Joshi, Soumen Chakrabarti, Pushpak Bhattacharyya
a0 hare,apurvaj,ashuj,soumen,pb
a1 @cse.iitb.ac.in
Dept. of Computer Science and Engg.,
Indian Institute of Technology, Mumbai, India
Abstract
Many researchers have used lexical networks
and ontologies to mitigate synonymy and polysemy
problems in Question Answering (QA), systems
coupled with taggers, query classifiers, and answer
extractors in complex and ad-hoc ways. We seek
to make QA systems reproducible with shared and
modest human effort, carefully separating knowl-
edge from algorithms. To this end, we propose
an aesthetically “clean” Bayesian inference scheme
for exploiting lexical relations for passage-scoring
for QA . The factors which contribute to the effi-
cacy of Bayesian Inferencing on lexical relations are
soft word sense disambiguation, parameter smooth-
ing which ameliorates the data sparsity problem and
estimation of joint probability over words which
overcomes the deficiency of naive-bayes-like ap-
proaches. Our system is superior to vector-space
ranking techniques from IR, and its accuracy ap-
proaches that of the top contenders at the TREC QA
tasks in recent years.
1 Introduction
This paper describes an approach to probabilistic in-
ference using lexical relations, such as expressed by
a WordNet, an ontology, or a combination, with ap-
plications to passage-scoring for open-domain ques-
tion answering (QA).
The use of lexical resources in Information Re-
trieval (IR) is not new; for almost a decade, the
IR community has considered the use of natural
language processing techniques (Lewis and Jones,
1996) to circumvent synonymy, polysemy, and other
barriers to purely string-matching search engines. In
particular, a number of researchers have attempted
to use the English WordNet to “bridge the gap” be-
tween query and response. Interestingly, the results
have mostly been inconclusive or negative (Fell-
baum, 1998a). A number of explanations have been
offered for this lack of success, some of which are
a2 presence of unnecessary links and absence of
necessary links in the WordNet (Fellbaum,
1998b),
a2 hurdle of Word Sense Disambiguation (WSD)
(Sanderson, 1994)
a2 ad-hocness in the distance and scoring func-
tions (Abe et al., 1996).
1.1 Question answering (QA)
Unlike IR systems which return a list of documents
in response to a query, from which the user must
extract the answer manually, the goal of QA is to
extract from the corpus direct answers to questions
posed in a natural language.
An important step before answer extraction is
to identify and rate candidate passages from docu-
ments which might contain the answer. The notion
of a passage is somewhat arbitrary: various notions
of a passage have emerged (Vorhees, 2000); For our
purposes, a passage comprises a3 consecutive sen-
tences, or a4 consecutive words.
In contrast to IR, where linguistic resources have
not been found very useful, QA has always de-
pended on a mixture of stock lexical networks and
custom ontologies (language-independent concep-
tual hierarchies) crafted through human understand-
ing of the task at hand (Harabagiu et al., 2000;
Clarke et al., 2001). Ontologies, hand-crafted and
customized, sometimes from the WordNet itself, are
employed for question type classification, relation-
ships between places, measures, etc.
The scoring (and thereby, ranking) of passages
through lexical networks or ontologies is more suc-
cessful in QA than in classic IR because of the na-
ture of the QA task. Passage-scoring in QA benefits
from indirect matches through an ontology.
By separating the passage-scoring algorithm from
the knowledge base, we can keep improving our sys-
tem by continually upgrading the lexical relations in
the knowledge base and retraining our inference al-
gorithm.
Map: a5 2 describes the related work. a5 3 gives the
motivation behind our approach and the background
information (WordNet and Bayesian inferencing).
a5 4 describes our QA system. Results are presented
in a5 5, and concluding remarks made in a5 6.
1
2 Related work
Information Retrieval (IR) systems such as
SMART (Buckley, 1985) rank documents for
relevance w.r.t. to a user query, based on keyword
match between the query and a document, each rep-
resented in the well-known “vector space model”.
The degree of match is measured as the cosine of
the angle between query and document vectors.
In QA, an IR subsystem is typically used to short-
list passages which are likely to embed the answer.
Usually, several enhancements are made to stock IR
systems to meet this task.
First, the cosine measure used in stock vector-
space systems will be biased against long docu-
ments even if they embed the answer in a narrow
zone. This problem can be ameliorated by repre-
senting suitably-sized passage windows (rather than
whole documents) as vectors. While scoring pas-
sages using the cosine measure, we can also ignore
passage terms which do not occur in the query.
The second issue is one of proximity. A passage
is likely to be promising if query words occur close
to one another. Commercial search engines reward
proximity of matched query terms, but in undocu-
mented ways. Clarke et al. (Clarke et al., 2001) ex-
ploit term proximity within documents for passage
scoring.
The third and most important limitation of stock
IR systems is the inability to bridge the lexical
chasm between question and potential answer via
lexical networks. One query from TREC (Vorhees,
2000) asks, “Who painted Olympia?” The answer
is in the passage: “Manet, who, after all, created
Olympia, gets no credit.”
QA systems use a gamut of techniques to deal
with this problem. FALCON (Harabagiu et al.,
2000) (one of the best QA systems in recent TREC
competitions) integrates syntactic, semantic and
pragmatic knowledge for QA. It uses WordNet-
based query expansion to try to bridge the lexical
chasm. WordNet is customized into a answer-type
taxonomy to infer the expected answer type for a
question. Named-entity recognition techniques are
also employed to improve quality of passages re-
trieved. The answers are finally filtered by justifying
them using abductive reasoning. Mulder (Kwok et
al., 2001) uses a similar approach to perform QA on
Web scale. The well-known START system (Katz, )
goes even further in this direction.
Discussion: In general, the TREC QA systems di-
vide QA into two tasks: identifying relevant doc-
uments and extracting answer passages from them.
For the former task, most systems use traditional IR
engines coupled with ad-hoc query expansion based
on WordNet. Handcrafted knowledge bases, ques-
tion/answer type classifiers and a variety of heuris-
tics are used for the latter task. Success in QA
comes at the cost of great effort in custom-designed
wordnets and ontologies, and expansion, matching
and scoring heuristics which need to be upgraded
as the knowledge bases are enhanced. Ideally, we
should use a knowledge base which can be readily
extended, and a core scoring algorithm which is ele-
gant and “universal”.
3 Proposed approach
3.1 An inferencing approach to QA
Given a question and a passage that contains the an-
swer, how do we correlate the two ? Take for exam-
ple, the following question
What type of animal is Winnie the Pooh?
and the answer passage is
A Canadian town that claims to be the birthplace
of Winnie the Pooh wants to erect a giant statue of
the famous bear; but Walt Disney Studios will not
permit it.
It is clear that there is a linkage between the ques-
tion word animal and the answer word bear. That
the word bear occurred in the answer, in the context
of Winnie, means that there was a hidden ”cause”
for the occurrence of bear, and that was the concept
of a6 animala7 .
In general, there could be multiple words in the
question and answer that are connected by many hid-
den causes. This scenario is depicted in figure a5 1.
The causes themselves may have hidden causes as-
sociated with them.
QUESTION ANSWER
NODESNODES
Hidden Causes that are switched on
Observed nodes(WORDS) 
Hidden Causes that are switched off(CONCEPTS)
(CONCEPTS)
Figure 1: Motivation
2
These causal relationships are represented in on-
tologies and WordNets. The familiar English Word-
Net, in particular, encodes relations between words
and concepts. For instance WordNet gives the hy-
pernymy relation between the concepts a6 animala7
and a6 beara7 .
3.2 WordNet
WordNet (Fellbaum, 1998b) is an online lexical ref-
erence system in which English nouns, verbs, ad-
jectives and adverbs are organized into synonym
sets or synsets, each representing one underly-
ing lexical concept. Noun synsets are related to
each other through hypernymy (generalization), hy-
ponymy (specialization), holonymy (whole of) and
meronymy (part of) relations. Of these, (hypernymy,
hyponymy) and (meronymy,holonymy) are comple-
mentary pairs.
The verb and adjective synsets are very sparsely
connected with each other. No relation is available
between noun and verb synsets. However, 4500 ad-
jective synsets are related to noun synsets with per-
tainyms (pertaining to) and attra (attributed with) re-
lations.
DOG, DOMESTIC_DOG, CANIS_FAMILIARIS 
CORGI, WELSH_CORGIFLAG
meronymy
(from CANIS, GENUS_CANIS)
hyponymy
Figure 2: Illustration of WordNet relations.
Figurea5 2 shows that the synset a6 dog, domes-
tic dog, canis familiarisa7 has a hyponymy link to
a6 corgi, welshcorgia7 and meronymy link to a6 flaga7
(“a conspicuously marked or shaped tail”). While
the hyponymy link helps us answer the question
(TREC#371) “A corgi is a kind of what?”, the
meronymy connection here is perhaps more confus-
ing than useful: this sense of flag is rare.
3.3 Inferencing on lexical relations
It is surprisingly difficult to make the simple idea
of bridging passage to query through lexical net-
works perform well in practice. Continuing the ex-
ample of Winnie the bear (section a5 3.1), the En-
glish WordNet has five synsets on the path from bear
to animal: a6 carnivore...a7 , a6 placental mammal...a7 ,
a6 mammal...a7 , a6 vertebrate..a7 , a6 chordate...a7 .
Some of these intervening synsets would be ex-
tremely unlikely to be associated with a corpus that
is not about zoology; a common person would more
naturally think of a bear as a kind of animal, skip-
ping through the intervening nodes.
It is, however, dangerous to design an algorithm
which is generally eager to skip across links in a lex-
ical network. E.g., few QA applications are expected
to need an expansion of “bottle” beyond “vessel”
and “container” to “instrumentality” and beyond.
Another example would be the shallow verb hierar-
chy in the English WordNet, with completely dis-
similar verbs within very few links of each other.
There is also the problem of missing links.
Another important issue is which ‘hidden causes’
(synsets) should be inferred to have caused words
in the text. This is a classical problem called
word sense disambiguation (WSD). For instance,
the word dog belongs to 6 noun synsets in Word-
Net. Which of the a8 synsets should be treated as the
‘hidden cause’ that generated the word dog in the
passage could be inferred from the fact that collie is
related to dog only through one of the latter’s senses
- it’s sense as a6 dog, domestic dog, Canis familiarisa7 .
But this problem of finding the ‘appropriate’ hidden
causes, in general, in non-trivial. Given that state-of-
the-art WSD systems perform not better than 74%
(Sanderson, 1994) (Lewis and Jones, 1996) (Fell-
baum, 1998b), in this paper, we use a probabilistic
approach to WSD - called ‘soft WSD’ (Pushpak, )
; hidden nodes are considered to have probabilisti-
cally ‘caused’ words in the question and answer or in
other words, causes are probabilistically ‘switched
on’.
Clearly, any scoring algorithm that seeks to uti-
lize WordNet link information must also discrimi-
nate between them based (at least) on usage statis-
tics of the connected synsets. Also required is an
estimate of the likelihood of instantiating a synset
into a token because it was “activated” by a closely
related synset. We find a Bayesian belief network
(BBN) a natural structure to encode such combined
knowledge from WordNet and corpus.
3.4 Bayesian Belief Network
A Bayesian Network (Heckerman, 1995) for a set of
random variables a9a11a10a12a6a13a9a15a14a13a16a17a9a19a18a20a16a22a21a22a21a22a21a23a16a17a9a25a24a26a7 consists
of a directed acyclic graph (DAG) that encodes a set
of conditional independence assertions about vari-
ables in a9 and a set of local probability distributions
3
associated with each variable. Let a27a29a28a31a30 denote the set
of immediate parents of a9a25a30 in the DAG, and a32a33a28
a30
a
specific instantiation of these random variables.
The BBN encodes the joint distribution
a34a36a35a23a37a39a38
a14a13a16
a38
a18a20a16a22a21a22a21a22a21a40a16
a38
a24a42a41 as
a34a36a35a43a37a39a38
a14a13a16
a38
a18a43a16a22a21a22a21a22a21a40a16
a38
a24a42a41a44a10
a24
a45
a30a47a46a33a14
a34a36a35a23a37a39a38
a30a49a48a32a50a28
a30
a41 (1)
Each node in the DAG encodes a34a36a35a40a37a39a38 a30 a48a32a50a28
a30
a41 as a
“conditional probability table” (CPT). Figure a5 3
shows a Bayesian belief network interpretation for
a part of WordNet. The synset a6 corgi, welsh corgia7
has a causal relation from a6 dog, domestic dog, ca-
nis familiarisa7 . A possible conditional probability
table for the network is shown to the right of the
structure.
DOG, DOMESTIC_DOG, CANIS_FAMILIARIS 
CORGI, WELSH_CORGI
               
Present    Absent
0.9               0.1      Present
0.01             0.99     Absent
P
A
R
E
N
T
      CHILD 
(CHILD)
(PARENT)
Figure 3: Causal relations between two synsets.
The idea of constructing BBN from WordNet has
been proposed by (Rebecca, 1998). But that idea is
centered around doing hard-sense disambiguation -
to find the ‘correct’ sense each word in the text.
In this paper, we particularly explore the idea of
doing soft sense disambiguation i.e. synsets are
probabilistically considered to be causes of their
constituent words. Moreover, WSD is not an end in
itself. The goal is to connect the words within ques-
tion and answer passage and also across the question
and answer passage. WSD is only a by-product.
Our goal is to build a QA system which imple-
ments a clear division of labor between the knowl-
edge base and the scoring algorithm, codifies the
knowledge base in a uniform manner, and thereby
enables a generic algorithm and a shared, extensible
knowledge base. Based on the discussion above, our
knowledge representation must be probabilistic, and
our system must combine and be robust to multiple,
noisy sources of information from query and answer
terms.
Moreover, we would like to be able to learn im-
portant properties of our knowledge base from con-
tinual training of our system with corpus samples
as well as samples of successful and unsuccessful
(question, answer) pairs. In essence, we would like
to automate as far as possible, the customization of
lexical networks to QA tasks. Given the English
WordNet, it should be possible to reconstruct our al-
gorithm completely from this paper.
Toward these ends, we describe how to induce
a Bayesian Belief Network (BBN) from a lexical
network of relations. Specifically, we propose a
semi-supervised learning mechanism which simul-
taneously trains the BBN and associates text tokens
,which are words, to synsets in the WordNet in a
probabilistic manner (“soft WSD”). Finally, we use
the trained BBN to score passages in response to a
question.
3.5 Building a BBN from WordNet
Our model of the BBN is that each synset from
WordNet is a boolean event associated with a ques-
tion, a passage, or both. Textual tokens are also
events. Each event is a node in the BBN. Events can
cause other events to happen in a probabilistic man-
ner, which is encoded in CPTs. The specific form
of CPT we use is the well-known noisy-OR of Pearl
(Pearl, 1988).
We introduce a node in the BBN for each noun,
verb, and adjective synset in WordNet. We also in-
troduce a node for each (non-stop-word) token in the
corpus and all questions. Hyponymy, meronymy,
and attribute links are introduced from WordNet.
Sense links are used to attach tokens to potentially
matching synsets. E.g., the string “flag” may be at-
tached to synset nodes a6 sag, droop, swag, flaga7 and
a6 a conspicuously marked or shaped taila7 . (The pur-
pose of probabilistic disambiguation is to estimate
the probability that the string “flag” was caused by
each connected synset node.)
This process creates a hierarchy in which the
parent-child relationship is defined by the semantic
relations in WordNet. a51 is a parent of a52 iff a51 is the
hypernym or holonym or attribute-of or a51 is a synset
containing the word a52 . The process by which the
Bayesian Network is built from the WordNet hyper-
graph of synsets and and from the mapping between
words and synsets is depicted in figure 4. We define
going-up the hierarchy as the traversal from child to
parent.
Ideally, we should update the entire BBN and its
CPTs while scanning over the training corpus. In
practice, BBN training and inference are CPU- and
memory-intensive processes.
We compromise by first attaching the token nodes
4
Add words as children
to their synsets
WORDNET 
HYPERGRAPH
WORDNET
Word − Synset maps
CONDITONAL
PROBABILITY 
TABLES FOR 
EACH NODE NETWORK
BELIEF
BAYESIAN
+ =
Figure 4: Building a BBN from WordNet and associated text
tokens.
to their synsets and then walking up the WordNet
hierarchy up to a maximum height decided purely
by CPU and memory limitations. We believe that
the probabilistic influence from distant nodes is too
feeble and unreliable to warrant modeling.
4 Our QA system
The overall question answering system that we pro-
pose is depicted in figure 5. The corresponding al-
gorithm is outlined in figure 6.
Question
50 Documents
Retrieval
Of
TFIDF
N−Word windows
BAYESIAN NETWORK
Offline TrainingCORPUS
PASSAGE
EXTRACTION
PASSAGE 
RANKING
Ranked  Passages
p1, p2 ....... pn
Figure 5: The overall QA system.
The question triggers the TFIDF retrieval mod-
ule to pick up 50 most relevant documents. These
documents are subjected to a sliding window to pro-
duce a53 passages of length a4 each. The Bayesian
belief network described in section 3.5 ranks these
passages. The first ranked passage is supposed to
contain the answer. The belief network parameters
are the CPTs, which are initialized as noisy-or CPTs.
The Bayesian belief network is trained offline using
1: Construct a Bayesian Network structure using the Word-
Net structure
2: Train the Bayesian network parameters on the corpus
containing the answers
3: Do question answering with trained Bayesian Network
Figure 6: The over-all question answering algorithm
1: while CPTs do not converge do
2: for each window of a54 words in the text do
3: Clamp the word nodes in the Bayesian Network to a
state of ‘present’
4: for each node in Bayesian network do
5: find its joint probabilities with all configurations
of its parent nodes (E Step)
6: end for
7: end for
8: Update the conditional probability tables for all ran-
dom variables (M Step)
9: end while
Figure 7: Training the Bayesian Network for a corpus
the Expectation Maximization algorithm (Dempster,
1977) on windows sliding over the whole corpus.
4.1 Training the belief network
The figure 7 describes the algorithm for training the
BBN obtained from the WordNet. We initialize the
CPTs as noisy-or. The instances we use for train-
ing are windows of length a3 each from the cor-
pus. Since the corpus is normally not tagged with
WordNet senses, all variables, other than the words
observed in the window (i.e. the synset nodes in
the BBN) are hidden or unobserved. Hence we use
the Expectation Maximization algorithm (Dempster,
1977) for parameter learning. For each instance,
we find the expected values of the hidden variables,
given the present state of each of the observed vari-
ables. These expected values are used after each
pass through the corpus to update the CPT for each
node. The iterations through the corpus are done
till the sum of the squares of Kullback-Liebler di-
vergences between CPTs in successive iterations do
not differ more than a threshold, or in other words,
till the convergence criterion is met. Figure a5 7 out-
lines the algorithm for training the Bayesian Net-
work over a corpus. We basically customize the
Bayesian Network CPTs to a particular corpus by
learning the local CPTs.
4.2 Ranking answer passages
Given a question, we rank the passages with the
joint probability of the question words, given the
candidate answer. Every question or answer can
be looked upon as an event in which the its word
nodes are switched to the state ‘present’. There-
fore, if a55a56a14a13a16a57a55a26a18a43a21a58a21a58a21a58a21a55a31a24 are passages and a59 is the ques-
tion, the answer is that passage a55a31a30 which maximizes
a60a61a37
a59a62a48a55a31a30a63a41 over all passagesa55a31a30 deemed as candidate an-
swers. a34a36a35a23a37 a59a62a48a55a31a30a64a41 is the joint probability of the words
of a59 , each being in state ‘present’ in the Bayesian
network, given that all the word nodes for a55a26a30 are
clamped to the state ‘present’ in the belief network.
5
1: Load the Bayesian Network parameters
2: for each question q do
3: for each candidate passage p do
4: clamp the variables (nodes) corresponding to the
passage words in network to a state of ‘present’
5: Find the joint probability of all question words being
in state ‘present’ i.e., a65a67a66a49a68a58a69a43a70a71a73a72
6: end for
7: end for
8: Report the passages in decreasing order of a65a67a66a49a68a58a69a43a70a71a73a72
Figure 8: Ranking answer passages for given question
Figure a5 8 outlines the actual passage ranking algo-
rithm.
The reason for choosing a34a36a35a43a37 a59a67a48a55a26a30a74a41 over a34a36a35a23a37a55a31a30a75a48a59a73a41
is that (a) a59 typically contains very few words.
a34a36a35a23a37
a55a26a30a49a48a59a73a41 , therefore, may not help in bridging the re-
lation between answer words. (b) The passage will
be penalized if contains many words which are not
present in the question and are also not closely re-
lated to the question words through the WordNet.
This could happen despite the fact that the passage
contains a few words which are all present in the
question and/or are semantically closely related to
the question, in addtion to containing the answer
to the question. Also, (c) if passages a55a26a30 ’s are of
varying lengths, a34a76a35a40a37 a59a62a48a55a31a30a63a41 ’s are brought to the same
scale—that of question words which are fixed across
passages/snippets, whereas, a34a76a35a40a37a55a31a30a49a48a59a73a41 can be affected
and penalized by long snippets.
In fact, our apprehensions about using a34a36a35a23a37a55a31a30a75a48a59a73a41
will be justified in the experimental section - the
QA performace obtained using a34a36a35a23a37a55 a30 a48a59a73a41 is drasti-
cally poorer - in fact it is worse than the baseline
QA algorithm.
Dealing with non-WordNet words: Suppose,
there is a word a77 in the question which is not there
in the WordNet. Like the answer passages, we could
have ignored such words. But, the question may be
seeking an answer to precisely such a word. Also,
the number of words being very small in the ques-
tion, no word in the question should be ignored. We
deal with this situation in the following way. We
call a word, a connecting word if it the key word
that links the passage to the question. Note that for
WordNet words, the connecting nodes were Word-
Net concepts. In the case of non-WordNet words,
we don’t have any hidden, connecting nodes. So we
consider the words themselves to be possible con-
nections.
Let a78a80a79a43a81a82a81a84a83a13a78a86a85a87a77 be a random variable which takes
the state ‘present’ if a77 is a connecting word between
the question and the answer. It’s state is ‘absent’ if
it is not a connecting word. Let a77a29a59 , a77a36a55 be random
variables that are ‘present’ if a77 occurs in the ques-
tion or answer respectively, else they are ‘absent’.
By Bayes rule, we get the following probability that
the word a77 occurs in the question, given that it oc-
curs in the answer (1=Present, 0=absent).
a65a62a66a49a68a58a88a84a69a36a89a91a90a22a70a88a31a71a92a89a93a90a17a72a26a94
a65a62a66a49a68a58a88a84a69a95a89a91a90a22a70a96a98a97a100a99a73a99a102a101a17a96a98a103a57a88a104a89a91a90a17a72a82a105a106a65a62a66a75a68a58a88a31a71a107a89a91a90a22a70a96a87a97a75a99a73a99a102a101a49a96a64a103a57a88a104a89a91a90a49a72a100a105
a65a67a66a100a68a58a96a98a97a100a99a73a99a102a101a17a96a98a103a39a88a108a89a91a90a49a72a74a109
a65a62a66a49a68a58a88a84a69a95a89a111a110a43a70a96a98a97a100a99a73a99a102a101a17a96a98a103a57a88a104a89a111a110a86a72a82a105a106a65a62a66a75a68a58a88a31a71a107a89a111a110a43a70a96a87a97a75a99a73a99a102a101a49a96a64a103a57a88a104a89a111a110a80a72a100a105
a65a67a66a49a68a58a96a87a97a75a99a102a99a102a101a17a96a98a103a39a88a108a89a108a110a80a72
where a60a113a112a62a37 a78a80a79a43a81a82a81a84a83a13a78a86a85a98a77 a10 a114a23a41 , a60a115a112a67a37 a77a106a59 a10
a114a102a48a78a80a79a43a81a82a81a84a83a13a78a86a85a98a77a116a10a117a114a23a41 ,
a60a113a112a62a37
a77a36a55a118a10a119a114a102a48a78a80a79a43a81a82a81a84a83a13a78a86a85a87a77a116a10a119a114a23a41 ,
and a60a115a112a67a37 a78a80a79a43a81a82a81a84a83a13a78a86a85a87a77a120a10a121a114a23a41 and their complements are
estimated from question answer pairs. Moreover, the
occurrence of non WordNet words is assumed to be
independent of each other and also of the occurrence
of WordNet words.
5 Experiments and results
We perform extensive experiments to evaluate our
system, using the TREC http://trec.nist.
gov/data/qa.html QA benchmark. We find
that our algorithm is a substantial improvement be-
yond a baseline IR approach to passage ranking.
Based on published numbers, it also appears to be
in the same league as the top performers at recent
TREC QA events. We also note that training our
system improves the quality of our ranking, even
though WSD accuracy does not increase, which af-
firms the belief that passage scoring need not depend
on perfect WSD, given we use a robust, ‘soft WSD’.
See section a5 3.3.
5.1 Experimental setup
We use the Text REtrieval Conference (TREC)
(Vorhees, 2000) corpus and question/answers from
its QA track. The corpus is 2 GB of newspaper arti-
cles. There is a set a122 of about 690 factual questions.
For each question, we retrieve the top a123a20a124 documents
using a standard TFIDF-based IR engine such as
SMART. We used the question set and correspond-
ing top 50 document collection from TREC 2001 for
our experiments. We used MXPOST (Ratnaparkhi,
1996), a maximum entropy based POS tagger. The
part of speech tag is used while mapping document
and question terms to their corresponding nodes in
the BBN.
The passage length we chose was a4a12a10a126a125a20a124 words.
Unless otherwise stated explicitly, the maximum
6
height upto which the BBN was used for inferenc-
ing for each Q-passage pair can be assumed to be
a127 .
5.2 Evaluation
TREC QA evaluation has two runs based on the
length of system response to a question. In the first
the response is a passage up to 250 bytes in size.
The second, more ambitious run asks for shorter re-
sponses of up to 50 bytes. (More recently, TREC has
updated its requirements to demand exact, extracted
answers.)
To determine if the response is actually an answer
to the question, TREC provides a set of regular ex-
pressions for each question. The presence of any of
these in the response indicates that it is a valid an-
swer. For evaluation the system is required to sub-
mit its top five responses for each question. This
is used to calculate the performance measure mean
reciprocal rank (MRR) for the system, defined as
a128a130a129a92a129
a10
a114
a48a131a122a132a48
a133a134a76a135
a136a100a137a20a138
a114
a35a100a139a141a140a67a142
a136
a143a144
a21 (2)
Here a35a75a139a141a140a62a142 a136 is the first rank at which correct answer
occurs for question a59a146a145a126a122 . If for a question a59 the
correct answer is not in the top 5 responses then
a14
a147a57a148a150a149a86a151a100a152 is taken to be zero.
5.3 Results
IR baseline: IR technology is widely accessible,
and forms our baseline. We construct 250-byte win-
dows of text as passages and compute the similarity
between these passages and the query. Because we
would not like to penalize passages for having terms
not in the question (provided they have at least some
query terms), we use an asymmetric TFIDF similar-
ity. Under this measure, the score of a passage is the
sum of the IDFs of the question terms contained in
the passage. If a153 is the document collection and a153a155a154
is the set of documents containing a85 , then one com-
mon form of IDF weighting (used by SMART again)
is a156a87a157a159a158
a37
a85a17a41a44a10 a160a58a161a141a162
a114a164a163a126a48a153a93a48
a48a153a155a154a100a48
a21 (3)
The IR baseline MRR is only about 0.3, which is
far short of Falcon, which has an MRR of almost 0.7.
The baseline MRR is low for the obvious reasons:
the IR engine cannot bridge the lexical gap.
System MRR
Asymmetric TFIDF 0.314
Untrained BBN 0.429
Trained BBN 0.467
Table 1: MRRs for baseline, untrained and trained BBNs
System MRR
FALCON 0.76
University of Waterloo 0.46
Queens College, CUNY 0.46
Table 2: MRRs for best performing systems in TREC9
Base BBN: Initialized with our default parame-
ters, our BBN-based approach achieves an MRR of
0.429, which is already a significant step up from the
IR baseline. A large component of this improvement
is caused by conflating different strings to common
synsets.
Trained BBN: We recalibrated our system after
training the BBN with the corpus. This resulted in
a visible improvement in our MRR, from 0.429 to
0.467, which takes us into the same league as the
systems from University of Waterloo and Queens
College, reported at TREC QA.
Tables a5 1 and a5 2 summarize our MRR results and
juxtapose them with the published MRRs for some
of the best-performing QA system in TREC 2000.
Given that we have invested zero customization ef-
fort in WordNet, it is impressive that our MRR com-
pares favorably with all but the best system.
Experiments for varying heights of BBN: The
MRR obtained went down to a124a42a21a166a165 a127 when the height
of the traced BBN was restricted to a114 , i.e. only words
and their immediate synsets were considered. It is
significant to note that even with immediate synset
expansion, there is a marginal improvement over as-
symmetric TFIDF. The MRR improved to a124a42a21a127 a125 and
a124a42a21
a127
a123 when the height was increased to a125 and a165 re-
spectively. These results are tabulated in table a5 3.
Experiments for restricting to WordNets of dif-
ferent parts of speech: The MRR found by us-
ing only the noun WordNet was a124a42a21a127 a114a13a123 . Words
in the remaining parts of speech were treated as
Height MRR
1 0.342
2 0.421
3 0.450
4 0.467
Table 3: MRRs for BBNs truncated at different heights
7
WordNet for diff POS MRR
Noun 0.415
Adjective 0.340
Verb 0.32
Noun+Adjective 0.442
Noun+Verb 0.393
Verb+Adjective 0.332
Noun+Verb+Adjective 0.467
Table 4: MRRs for BBNs restricted to diff parts of WordNet
Expt setup MRR
a60a132a37
a122a61a48a51a113a41 with only WNet words 0.370
No Bayesian Inferencing 0.30
a60a132a37a57a60a168a167a22a169
a48a131a122a159a170a171a83
a167
a85a87a172a64a79a43a81a33a41 0.021
Table 5: MRRs for other experiments
non WordNet words in this experiment. The MRR
dropped to a124a42a21a166a165 a127 a124 when only the adjective WordNet
was used. The MRR found using only the verb
WordNet was a low a124a42a21a166a165a141a125 . This is because the verb
WordNet is very shallow and many semantically dis-
tant verbs are connected closely together. The MRR
score obtained by considering noun+adjective part
of WordNet was a124a42a21a127a141a127 a125 , that obtained by considering
noun+verb part of WordNet was a124a42a21a166a165a141a173a141a165 and that ob-
tained by considering verb+adjective part of Word-
Net was a124a42a21a166a165a141a165a141a165 . These results are summarized in ta-
ble a5 4. The results seem to justify the observation
that the verb WordNet in its current form is shallow
in height and has high in/out degree for each node;
this is mainly due to the high ambiguity of verbs.
But coupled with noun and adjective WordNets, the
verb WordNet improves overall performance.
Miscellaneous experiments: The MRR obtained
by considering only WordNet words was a124a42a21a166a165a73a174a43a124
which indicates that we cannot afford to ignore the
non-WordNet words. Also it seems that induc-
ing ‘semantic-similarity’ between words not in the
WordNet vocabulary is not so much required. By
skipping Bayesian inferencing altogether, we get an
MRR of a124a42a21a166a165a20a124 which is the same as for asymmetric
TFIDF mentioned earlier. The MRR drastically fell
to a124a42a21a175a124a73a125a42a114 when a60a61a37a57a60a176a167a22a169 a48a131a122a113a170a171a83 a167 a85a87a172a64a79a43a81a33a41 was used to rank
the passages. This partly justifies the apprehension
about finding the probability of passage given ques-
tion which was expressed earlier - that is, passages
get penalized if they contain lots of words which are
not either not there in the question or are not related
to words in the question. These results are summa-
rized in table a5 5.
The effect of WSD: It is interesting to note that
training does not substantially affect disambiguation
accuracy (which stays at about 75%), and MRR im-
proves despite this fact. This seems to indicate that
learning joint distributions between query and can-
didate answer keywords (via synset nodes, which are
“bottleneck” variables in BBN parlance) is as impor-
tant for QA as is WSD. Furthermore, we conjecture
that “soft” WSD is key to maintaining QA MRR in
the face of modest WSD accuracy.
5.4 Analysis
In the following, we analyse how Bayesian inferenc-
ing on lexical relations contributes towards ranking
passages.
How joint probability helps For finding the prob-
ability of question given a passage, we take the joint
probability of the question words, conditioned on
the (evidence of) answer words. Thus we attempt
to overcome the usual bottleneck of assumption of
independence of words as in the naive Bayes model.
The relations of question words between themselves
and with words in the answer is what precisely helps
in giving a joint probability that is different from a
naive product of marginals. This will be illustrated
in section a5 5.5.
How parameter smoothing helps If a question
word does not occur in the answer, the marginal
probability of that word should be high if it strongly
relates to one or more words in the answer through
WordNet. Without using WordNet, one could re-
sort to finding this marginal probability from a cor-
pus. These probabilities are remarkably low even
for words that are very semantically related to words
in the answer and this will be illustrated in the case
studies in section a5 5.5. This problem could be at-
tributed to data sparsity
5.5 Case studies
Case 1: This example shows that the passage in
figure a5 10 contains the correct answer to the ques-
tion in figure a5 9 and was given rank a114 . The interest-
ing observation is that the words kind and type are
related correctly through the WordNet to give high
marginal probability to the word kind (0.557435) in
the question, even though it does not occur in the
answer. This is depicted in figure a5 12.
The marginal probability of the same word (given
that its is absent in the answer passage), as deter-
mined by corpus statistics is 0.00020202 - which is
very small. This illustrates the advantage of param-
eter smoothing.
8
TREC Question ID 371: A corgi is a kind of
what?
Figure 9: Sample question Q1
Bayesian Marginal Probs: corgi: 1.000000,
kind: 0.557435 ....corgis: They are of course
collie-type dogs originally bred for cattle herding.
As such they will chase anything particularly an-
kles....
Figure 10: Answer for Q1, Rank 1, Score(Joint Probability) =
0.893133, (Document ID:AP881106-0015)
Bayesian Marginal Probs: corgi:1.000000,
kind:0.006421 ....current favorite. So are bull-
dogs. Jack Russell terriers are popular with the
horsy set. “ The short-legged welsh corgi is big
( QueenP elizabeth ii has at least one ). And
so, of course, is the english bull terrier (thanks to
Anheuser-Busch, Bud Light and Spuds. MacKen-
zie). Barbara.....
Figure 11: Non-answer for Q1, Rank 2, Score(Joint Probability)
= 0.647734, (Document ID:WSJ900423-0005)
TYPE
KIND
CHARACTER, TYPE....
CATEGORY
KIND, SORT, FORM, VARIETY
TYPE (bio)
TYPE (subdivision)
TYPE (character)
TYPE  (same symbol tokens)TYPE (metal block)
SYMBOL
TAXONOMIC_GROUP, TAXON
Question
Word
Answer
Word
Figure 12: Relation between kind in question and type in answer
Additionally, the joint probability of ques-
tion words given the passage words of fig-
ure a5 10 (a124a42a21a166a177a141a173a141a165a42a114a13a165a141a165 ) is not the product of their
marginals (a60a61a37 a78a80a79 a112a43a169 a172a86a48a60 a51a113a178a36a178a179a51a113a180a159a181a155a41a182a10 a114a20a21a175a124a141a124a141a124a141a124a141a124a141a124 ,
a60a61a37a74a183
a172a63a81a56a184a185a48
a60
a51a113a178a36a178a179a51a113a180a159a181a186a41a164a10a187a124a42a21a166a123a141a123a73a174
a127
a165a141a123 ). The reason for
this is that the word dog that occurs in the answer
passage is related to the word corgie in the question
through WordNet as shown in figurea5 13. It can be
seen easily that these lexical relations increase the
joint probability of the question words, given the an-
swer words, over the product of the marginals of the
individual words.
In contrast, the passage of figure a5 11 which
contains no answer to the question, also contains
no word which is closely related to the word
DOG
CORGI
CORGI, WELSH_CORGI
DOG, DOMESTIC_DOG, CANIS_FAMILIARIS
FRUMP, DOG
(from CANIS, GENUS_CANIS)
DOG, BOUNDER
ANDIRON, DOG...
PAWL, DOG...
DOG,  MAN
Answer
Word
Question
Word
Figure 13: Relation between dog in answer and corgi in ques-
tion
kind through WordNet. Therefore, the marginal
probabilities as well as the joint probability of
same question words given this passage are low as
compared to the passage of figurea5 10. As a result
the second passage gets a low rank.
Case 2: The passage in figure a5 15 was highest
ranked for the question in figurea5 14, even though it
does not contain the answer central america. This
is because, all question words occur in the passage
and therefore, the passage gets a rank of a114 . This
highlights a limitation of our mechanism. On the
other hand, the passage ranked a125 a24a20a188 contains the an-
swer. It gets a joint probability score of a124a42a21a166a177a141a173a20a124a67a114a13a173a141a125 ,
even though the word belize does not occur in the
answer. This is because belize is connected to the
word central america and also to country through
WordNet. The passage shown in figurea5 17, which
does not contain the answer, got a pretty low rank
of 10 because it induced a low joint probability of
a124a42a21a175a124a73a165a141a165
a127
a123a42a114 on the question even though the word be-
lize was present in the passage, because locate was
absent in the passage and it is not immediatly con-
nected to other words in the passage. This again il-
lustrates the advantage of using Bayesian inferenc-
ing on lexical relations.
Case 3: Here we present an example to illustrate
where the mechanism can go wrong due to the of
absence of links. The passage in figure a5 19 induces
a conditional joint probability of a114 on the question
in figure a5 18, because the passage contains all the
words present in the question. The passage however
does not answer the question. On the other hand,
the passage shown in figure a5 20 contains the answer,
but induces a lower joint probability on the question
- because the verb stand for is not closely related,
through WordNet to any of the words in the pas-
sage. In fact, one would have expected stands for
and stand for to be related to each other through
9
TREC Question ID : 202 Where is Belize lo-
cated ?
Figure 14: Sample question Q2
Bayesian Marginal Probs: belize: 1.000000, lo-
cate: 1.000000 ....settlers has been confirmed to
the east of the historic monuments that are being
used as a reference point with Belize . She pointed
out that in case they prove the settlement is located
in the protected Mayan biosphere area and that it
was established illegally , the settlers will have to
leave the area , but the.....
Figure 15: Non-Answer to Q2, Rank = 1, Score(Joint Probabil-
ity) = 1, DocID: FBIS3-10202
Bayesian Marginal Probs: belize: 0.889529, lo-
cate: 1.000000 ....confirmed that the Belizean
Government will assume responsibility for its own
defense as of 1 January 1994 and announced that
it had started the “ immediate withdrawal of the
UK troops stationed in that country located in the
central american isthmus . Lourdes ......
Figure 16: Answer to Q2, Rank = 2, Score(Joint Probability) =
0.890192, DocID: FBIS3-50428
Bayesian Marginal Probs: belize: 1.000000, lo-
cate: 0.033451 ....prepared to begin negotiations
on the territorial dispute with Guatemala ; : adding
that a commission has been created for this pur-
pose and only the final details must be settled . The
Guatemalan Government has recognized Belize ’s
independence ; : therefore , we have accepted the
fact that a .....
Figure 17: Non-answer for Q2, Rank = 10, Score(Joint Proba-
bility) = 0.452310, DocID: FBIS4-56830
WordNet.
6 Discussion and future work
We have described a passage-scoring algorithm for
QA via Bayesian inference on lexical relations. By
separating the inference algorithm from the design
of the knowledge base, we made our system exten-
sible and trainable from a corpus. The accuracy of
our system is better than IR-like scoring techniques,
and compares favorably with well-known QA sys-
tems, as shown in section 5.
Our work hinges upon the existence of lexical re-
lations in the WordNet. We would like to point out
here that no special efforts were made in the con-
struction of the Bayesian Network from WordNet
nor did we attempt to fill in the desirable ‘missing
links’ between words or synsets in WordNet or re-
TREC Question ID : 224 What does laser
stand for ?
Figure 18: Sample question Q3
Bayesian marginal Probs: laser: 1.000000,
stand for: 1.000000 ....Yu.A. Rezunkov , can-
didate of technical sciences , department head ,
V.S. Sirazetdinov , candidate of technical sciences
, manager of test stand for adaptive laser systems ,
A.V . Charukhchev ,.....
Figure 19: Non-Answer to Q3, Rank = 1, Score(Joint Probabil-
ity) = 1, DocID: FBIS4-47304
Bayesian marginal Probs: laser: 1.000000,
stand for: 0.073516 ...Laser stands for light
amplification by stimulated emission of radia-
tion. Both masers and lasers are devices contain-
ing crystal , gas or other substances that get atoms
so excited as they bounce back and forth in step
between two mirrors that they finally burst out in
one coherent .....
Figure 20: Answer to Q3, Rank = 25, Score(Joint Probability)
= 0.890192, DocID: FBIS3-50428
Bayesian marginal Probs: laser: 1.000000,
stand for: 0.060797
Corpus based-marginal Probs: laser: 0.990561,
stand for: 2.886e-05 ....surface plasma by inter-
action of laser radiation and solid targets covering
the 10 a189 .sup a190 5 a189 / a190 -10 a189 .sup a190 10 a189 / a190 . range
of radiation intensity being essentially considered
here along with negative and positive.....
Figure 21: Non-answer for Q3, Rank = 50, Score(Joint Proba-
bility) = 0.86329, DocID: FBIS4-22835
move spurious links in WordNet. Thus, we are able
to find probabilities based on semantic relations to
the extent given by links in WordNet and we are able
to uncorrelated words from each other to the extent
they are disconnected in WordNet. To some extent,
we attempt to learn the Bayesian Network parame-
ters and this does result in improvement in Question
Answering performance. But it will be interesting to
see if training the network with bigger corpora im-
proves the performance further. Another experiment
that remains to be tried is training the Bayesian Net-
work with samples of successful and unsuccessful
(question, answer) pairs.
One thing to note is that if all the question words
are contained in the passage, the passage will get a
high rank because it will induce a joint probability
score of a114 on the question. This can happen even if
the answer is not contained in the passage.
10
Another limitation is the computational and mem-
ory cost. On an average it took 0.03 seconds for
Bayesian inferencing on a passage. The memory re-
quirement goes upto 30MB. One future work will
comprise of reducing the online memory and com-
putational requirements by simplifying the network
structure and/or making certain computations of-
fline.
We would also like to find better initial values to
speed up learning and avoid local optima. We would
like to re-introduce the notion of lexical proximity
into our inference process, so as to further improve
the accuracy of WSD. We also wish to explore how
continual feedback and retraining of the BBN can
improve the accuracy of our system.

References
Abe, Naoki, and Hang Li. 1996. Learning word association
norms using tree cut pair models. In Proceedings of the 13th
International Conference on Machine Learning.
C. Buckley. 1985. Implementation of the smart information
retrieval system. Technical report, Technical Report TR85-
686, Department of Computer Science, Cornell University.
C. L. A. Clarke, Gordon V. Cormack, and Thomas R. Lynam.
2001. Exploiting redundancy in question answering. In
Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information re-
trieval, pages 358–365. ACM Press.
C. Fellbaum, 1998. WordNet: An Electronic Lexical Database,
chapter Using WordNet for Text Retrieval, pages 285–303.
The MIT Press: Cambridge, MA.
Christiane Fellbaum. 1998b. WordNet: An Electronic Lexical
Database. The MIT Press.
Sanda Harabagiu, Dan Moldovan, Marius Pasca, Rada Mihal-
cea, Mihai Surdeanu, Razvan Bunescu, Roxana Girju, Vasile
Rus, and Paul Morarescu. 2000. Falcon: Boosting knowl-
edge for answer engines. In Proceedings of the ninth text
retrieval conference (TREC-9), November.
David Heckerman. 1995. A Tutorial on Learning Bayesian
Networks. Technical Report MSR-TR-95-06, March.
Boris Katz. 1997. From sentence processing to information
access on the world wide web. AAAI Spring Symposium on
Natural Language Processing for the World Wide Web, Stan-
ford University, Stanford CA.
Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. 2001.
Scaling question answering to the web. In Proceedings of
the Tenth International World Wide Web Conference, pages
150–161.
David D. Lewis and Karen Sparck Jones. 1996. Natural lan-
guage processing for information retrieval. Communications
of the ACM, 39(1):92–101.
J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference. Morgan Kaufmann Pub-
lishers, Inc.
Adwait Ratnaparkhi. 1996. A maximum entropy part-of-
speech tagger. In Proceedings of the Empirical Methods in
Natural Language Processing Conference, May 17-18, 1996.
University of Pennsylvania.
Mark Sanderson. 1994. Word sense disambiguation and in-
formation retrieval. In Proceedings of SIGIR-94, 17th ACM
International Conference on Research and Development in
Information Retrieval, pages 49–57, Dublin, IE.
Ellen Vorhees. 2000. Overview of TREC-9 question answering
track. Text REtreival Conference 9.
Wiebe, Janyce, O’Hara, Tom, Rebecca Bruce. 1998. Con-
structing Bayesian networks from WordNet for word sense
disambiguation: representation and processing issues. In
Proc. COLING-ACL ’98 Workshop on the Usage of Word-
Net in Natural Language Processing Systems.
.P. Dempster, N.M. Laird and D.B. Rubin. 1977. Maximum
Likelihood from Incomplete Data via The EM Algorithm. In
Journal of Royal Statistical Society, Vol. 39, pp. 1-38, 1977.
Ganesh Ramakrishnan and Pushpak Bhattacharyya. 2003. Text
Representation with WordNet Synsets: A Soft Sense Disam-
biguation Approach. To appear in Proceedings of the 
International Conference on Natural Language in Informa-
tion Systems, Springer Verlag.
