Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 1081–1088,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Reranking Answers for Definitional QA Using Language Modeling 
 
 
Yi Chen 
School of Software Engi-
neering 
Chongqing University 
Chongqing, China, 400044 
126cy@126.com 
Ming Zhou 
Microsoft Research Asia 
5F Sigma Center, No.49 Zhichun 
Road, Haidian 
Bejing, China, 100080 
mingzhou@microsoft.com 
Shilong Wang 
College of Mechanical En-
gineering 
Chongqing University 
Chongqing, China, 400044 
slwang@cqu.edu.cn 
 
  
Abstract* 
Statistical ranking methods based on cen-
troid vector (profile) extracted from ex-
ternal knowledge have become widely 
adopted in the top definitional QA sys-
tems in TREC 2003 and 2004. In these 
approaches, terms in the centroid vector 
are treated as a bag of words based on the 
independent assumption. To relax this as-
sumption, this paper proposes a novel 
language model-based answer reranking 
method to improve the existing bag-of-
words model approach by considering the 
dependence of the words in the centroid 
vector. Experiments have been conducted 
to evaluate the different dependence 
models. The results on the TREC 2003 
test set show that the reranking approach 
with biterm language model, significantly 
outperforms the one with the bag-of-
words model and unigram language 
model by 14.9% and 12.5% respectively 
in F-Measure(5). 
1 Introduction 
In recent years, QA systems in TREC (Text RE-
trieval Conference) have made remarkable pro-
gress (Voorhees, 2002). The task of TREC QA 
before 2003 has mainly focused on the factoid 
questions, in which the answer to the question is 
a number, a person name, or an organization 
name, or the like. 
Questions like “Who is Colin Powell?” or 
“What is mold?” are definitional questions 
                                                 
*This work was finished while the first author was visiting 
Microsoft Research Asia during March 2005-March 2006 as 
a component of the project of AskBill Chatbot led by Dr. 
Ming Zhou. 
(Voorhees, 2003). Statistics from 2,516 Fre-
quently Asked Questions (FAQ) extracted from 
Internet FAQ Archives1 show that around 23.6% 
are definitional questions. This indicates that 
definitional questions occur frequently and are 
important question types. TREC started the 
evaluation for definitional QA in 2003. The defi-
nitional QA systems in TREC are required to 
extract definitional nuggets/sentences that con-
tain the highly descriptive information about the 
question target from a given large corpus. 
For definitional question, statistical ranking 
methods based on centroid vector (profile) ex-
tracted from external resources, such as the 
online encyclopedia, are widely adopted in the 
top systems in TREC 2003 and 2004 (Xu et al., 
2003; Blair-Goldensohn et al., 2003; Wu et al., 
2004). In these systems, for a given question, a 
vector is formed consisting of the most frequent 
co-occurring terms with the question target as the 
question profile. Candidate answers extracted 
from a given large corpus are ranked based on 
their similarity to the question profile. The simi-
larity is normally the TFIDF score in which both 
the candidate answer and the question profile are 
treated as a bag of words in the framework of 
Vector Space Model (VSM).  
VSM is based on an independence assumption, 
which assumes that terms in a vector are statisti-
cally independent from one another. Although 
this assumption makes the development of re-
trieval models easier and the retrieval operation 
tractable, it does not hold in textual data. For ex-
ample, for question “Who is Bill Gates?” words 
“born” and “1955” in the candidate answer are 
not independent. 
In this paper, we are interested in considering 
the term dependence to improve the answer 
reranking for definitional QA. Specifically, the 
                                                 
1 http://www.faqs.org/faqs/ 
1081
language model is utilized to capture the term 
dependence. A language model is a probability 
distribution that captures the statistical regulari-
ties of natural language use. In a language model, 
key elements are the probabilities of word se-
quences, denoted as P(w1, w2, ..., wn) or P (w1,n) 
for short. Recently, language model has been 
successfully used for information retrieval (IR) 
(Ponte and Croft, 1998; Song and Croft, 1998; 
Lafferty et al., 2001; Gao et al., 2004; Cao et al., 
2005). Our natural thinking is to apply language 
model to rank the candidate answers as it has 
been applied to rank search results in IR task.  
The basic idea of our research is that, given a 
definitional question q, an ordered centroid OC 
which is learned from the web and a language 
model LM(OC) which is trained with it. Candi-
date answers can be ranked by probability esti-
mated by LM(OC). A series of experiments on 
standard TREC 2003 collection have been con-
ducted to evaluate bigram and biterm language 
models.  Results show that both these two lan-
guage models produce promising results by cap-
turing the term dependence and biterm model 
achieves the best performance. Biterm language 
model interpolating with unigram model 
significantly improves the VSM and unigram 
model by 14.9% and 12.5% in F-Measure(5).  
In the rest of this paper, Section 2 reviews re-
lated work. Section 3 presents details of the pro-
posed method. Section 4 introduces the structure 
of our experimental system. We show the ex-
perimental results in Section 5, and conclude the 
paper in Section 6. 
2 Related Work 
Web information has been widely used for an-
swer reranking and validation. For factoid QA 
task, AskMSR (Brill et al., 2001) ranks the an-
swers by counting the occurrences of candidate 
answers returned from a search engine. Similarly, 
DIOGENE (Magnini et al., 2002) applies search 
engines to validate candidate answers. 
For definitional QA task, Lin (2002) presented 
an approach in which web-based answer rerank-
ing is combined with dictionary-based (e.g., 
WordNet) reranking, which leads to a 25% in-
crease in mean reciprocal rank (MRR). Xu et al. 
(2003) proposed a statistical ranking method 
based on centroid vector (i.e., vector of words 
and frequencies) learned from the online ency-
clopedia (i.e., Wikipedia2) and the web. Candi-
                                                 
2 http://www.wikipedia.org 
date answers were reranked based on their simi-
larity (TFIDF score) to the centroid vector. Simi-
lar techniques were explored in (Blair-
Goldensohn et al., 2003). In this paper, we ex-
plore the dependence among terms in centroid 
vector for improving the answer reranking for 
definitional QA. 
In recent years, language modeling has been 
widely employed in IR (Ponte and Croft, 1998; 
Song and Croft, 1998; Miller and Zhai, 1999; 
Lafferty and Zhai, 2001). The basic idea is to 
compute the conditional probability P(Q|D), i.e., 
the probability of generating a query Q given the 
observation of a document D. The searched 
documents are ranked in descending order of this 
probability.  
Song and Croft (1998) proposed a general lan-
guage model to incorporate word dependence by 
using bigrams. Srikanth and Srihari (2002) intro-
duced biterm language models similar to the bi-
gram model except that the constraint of order in 
terms is relaxed and improved performance was 
observed. Gao et al. (2004) presented a new 
method of capturing word dependencies, in 
which they extended state-of-the-art language 
modeling approaches to information retrieval by 
introducing a dependence structure that learned 
from training data. Cao et al. (2005) proposed a 
novel dependence model to incorporate both re-
lationships of WordNet and co-occurrence with 
the language modeling framework for IR. In our 
approach, we propose bigram and biterm models 
to capture the term dependence in centroid vector. 
Applying language modeling for the QA task 
has not been widely researched. Zhang D. and 
Lee (2003) proposed a method using language 
model for passage retrieval for the factoid QA. 
They trained two language models, in which one 
was the question-topic language model and the 
other was passage language model. They utilized 
the divergence between the two language models 
to rank passages. In this paper, we focus on 
reranking answers for definitional questions. 
As other ranking approaches, Xu, et al. (2005) 
formalized ranking definitions as classification 
problems, and Cui et al. (2004) proposed soft 
patterns to rank answers for definitional QA.  
3 Reranking Answers Using Language 
Model 
3.1 Model background 
In practice, language model is often approxi-
mated by N-gram models.  
Unigram:  
1082
(1)                     211 ))...P(w)P(wP(w)P(w n,n =  
Bigram:  
(2)        11211 )|w)...P(w|w)P(wP(w)P(w n-n,n =  
The unigram model makes a strong assump-
tion that each word occurs independently. The 
bigram model takes the local context into con-
sideration. It has been proved to work better than 
the unigram language model in IR (e.g., Song 
and Croft, 1998). 
Biterm language models are similar to bigram 
language models except that the constraint of 
order in terms is relaxed. Therefore, a document 
containing information retrieval and a document 
containing retrieval (of) information will be as-
signed the same generation probability. The 
biterm probabilities can be approximated using 
the frequency of occurrence of terms.  
Three approximation methods were proposed 
in Srikanth and Srihari (2002). The so-called 
min-Adhoc approximation truly relaxes the con-
straint of word order and outperformed other two 
approximation methods in their experiments. 
(3)            )}(),(min{ ),(),()|(
1
11
1
ii
iiii
iiBT wCwC
wwCwwCwwP
−
−−
−
+≈  
Equation (3) is the min-Adhoc approximation. 
Where C(X) gives the occurrences of the string X.  
3.2 Reranking based on language model  
In our approach, we adopt bigram and biterm 
language models. As a smoothing approach, lin-
ear interpolation of unigrams and bigrams is em-
ployed.  
Given a candidate answer A=t1t2...ti...tn and a 
bigram or biterm back-off language model OC 
trained with the ordered centroid, the probability 
of generating A can be estimated by Equation (4). 
[ ]∏
=
−−+=
=
n
i
iii
n
OttPOCtPOCtP
OCttPOCAP
2
11
1
C) ,|()1(  )|()|(                  
(4)                                                            )|,...,(   )|(
λλ
 
where OC stands for the language model of the 
ordered centroid and λ  is the mixture weight 
combining the unigram and bigram (or biterm) 
probabilities. After taking logarithm and expo-
nential for Equation (4), we get Equation (5). 
[ ] (5)   ) ,|()1(  )|(log
  )|(log
exp )(
2 1
1








∑ −+
+
=
= −
n
i iii
OCttPOCtP
OCtP
AScore λλ   
We observe that this formula penalizes ver-
bose candidate answers. This can be alleviated 
by adding a brevity penalty, BP, which is in-
spired by machine translation evaluation (Pap-
ineni et al., 2001).  
(6)                                    1 ,1minexp  







 −=
A
ref
L
LBP  
where Lref is a constant standing for the length of 
reference answer (i.e., centroid vector). LA is the 
length of the candidate answer. By combining 
Equation (5) and (6), we get the final scoring 
function. 
[ ]







∑ −+
+
×







 −=
×=
= −
n
i iiiA
ref
OCttPOCtP
OCtP
L
L
AScoreBPAFinalScore
2 1
1
) ,|()1(  )|(log
  )|(log
exp  1 ,1minexp    
(7)                                                                )(    )(
λλ
 
3.3 Parameter estimation 
In Equation (7), we need to estimate three pa-
rameters: P(ti|OC), P(ti|ti-1, OC) and λ .  
For P(ti|OC), P(ti|ti-1, OC), maximum likeli-
hood estimation (MLE) is employed.  
(8)                                          )(  )|(
OC
iOC
i N
tCountOCtP =  
(9)                                )( ),(  ),|(
1
1
1
−
−
− =
iOC
iiOC
ii tCount
ttCountOCttP  
where CountOC(X) is the occurrences of the string 
X in the ordered centroid and NOC stands for the 
total number of tokens in the ordered centroid. 
For biterm language model, we use the above 
mentioned min-Adhoc approximation (Srikanth 
and Srihari, 2002). 
(10)   )}(),(min{ ),(),(  ),|(
1
11
1
iOCiOC
iiOCiiOC
iiBT tCounttCount
ttCountttCountOCttP
−
−−
−
+=  
For unigram, we do not need smoothing be-
cause we only concern terms in the centroid vec-
tor. Recall that bigram and biterm probabilities 
have already been smoothed by interpolation. 
Theλ  can be learned from a training corpus 
using an Expectation Maximization (EM) algo-
rithm. Specifically, we estimate λ  by maximiz-
ing the likelihood of all training instances, given 
the bigram or biterm model:  
[ ]∑ ∑
∑
= =
−
=
∗





 −+=
=
||
1 2
)(
1
)()(
||
1
)(
)(
)(
1
)|()1()(logmax arg     
(11)                                                   )|...(max arg
INS
j
l
i
j
i
j
i
j
i
INS
j
j
jl
j
j
ttPtP
OCttP
λλ
λ
λ
λ
 
BP and P(t1) are ignored because they do not 
affect λ . λ  can be estimated using EM iterative 
procedure: 
1) Initialize λ  to a random estimate between 0 
and 1, i.e., 0.5; 
2) Update λ  using: 
∑∑
= −=
+
−+−×=
jl
i
j
i
j
i
rj
i
r
j
i
rINS
j j
r
ttPtP
tP
lINS 2 )( 1)()()()(
)()(||
1
)1( (12)  
)|()1()(
)(
1
1
||
1
λλ
λλ  
where INS denotes all training instances and 
|INS| gives the number of training instances 
which is used as a normalization factor. lj gives 
1083
the number of tokens in the jth instance in the 
training data; 
3) Repeat Step 2 until λ  converges. 
We use the TREC 2004 test set3 as our train-
ing data and we set λ  as 0.4 for bigram model 
and 0.6 for biterm model according to the ex-
perimental results. 
4 System Architecture 
Target
(e.g., Aaron Copland)
Ordered centroid list
(e.g., born Nov 14 1900)
Candidate answers
Removing 
redundant answers
Extracting 
candidate answers
Answers
(e.g., American composer)
Learning ordered 
centroid
Answer reranking
Training language 
model
AQUAINT
Web
Stage 1 Training language model
Stage 3 Removing redundancies Stage 2 Reranking using LM
 
Figure 1. System architecture. 
We propose a three-stage approach for answer 
extraction. It involves: 1) learning a language 
model from the web; 2) adopting the language 
model to rerank candidate answers; 3) removing 
redundancies. Figure 1 shows five main modules. 
Learning ordered centroid:  
1) Query expansion. Definitional questions are 
normally short (i.e., who is Bill Gates?). Query 
expansion is used to refine the query intention.  
First, reformulate query via simply adding clue 
words to the questions. i.e., for “Who is ...?” 
question, we add the word “biography”; and for 
“What is ...?” question, we add the word “is usu-
ally”, “refers to”, etc. We learn these clue words 
using the similar method proposed in (Ravi-
chandran and Hovy, 2002). Second, query a web 
search engine (i.e., Google4) with reformulated 
query and learn top-R (we empirically set R=5) 
most frequent co-occurring terms with the target 
from returned snippets as query expansion terms; 
2) Learning centroid vector (profile). We query 
Google again with the target and expanded terms 
learned in the previous step, download top-N (we 
empirically set N=500 based on the tradeoff be-
tween the snippet number and the time complex-
ity) snippets, and split snippets into sentences. 
Then, we retain the generated sentences that con-
tain the target, denoted as W. Finally, learn top-
M (We empirically set M=350) most frequent co-
                                                 
3 The test data for TREC-13 includes 65 definition questions. 
NIST drops one in the official evaluation. 
4 http://www.google.com 
occurring terms (stemmed) from W using Equa-
tion (15) (Cui et al., 2004) as the centroid vector.   
(13)     )()1)(log()1)(log( )1),(log()( tidfTCounttCount TtCotWeight ×+++ +=  
where Co(t, T) denotes the number of sentences 
in which t co-occurs with the target T, and 
Count(t) gives the number of sentences contain-
ing the word t. We also use the inverse document 
frequency of t, idf(t) 5, as a measurement of the 
global importance of the word; 
3) Extracting ordered centroid. For each sentence 
in W, we retain the terms in the centroid vector 
as the ordered centroid list. Words not contained 
in the centroid vector will be treated as the “stop 
words” and ignored.  
E.g., “Who is Aaron Copland?”, the ordered 
centroid list is shown below(where italics are 
extracted and put in the ordered centroid list): 
1. Today's Highlight in History: On No-
vember 14, 1900, Aaron Copland, one 
of America's leading 20th century com-
posers, was born in New York City. ⇒ 
November 14 1900 Aaron Copland 
America composer born New York City 
2. ... 
Extracting candidate answers: We extract can-
didates from AQUAINT corpus.  
1) Querying AQUAINT corpus with the target 
and retrieve relevant documents;  
2) Splitting documents into sentences and ex-
tracting the sentences containing the target. Here 
in order to improve recall, simple heuristics rules 
are used to handle the problem of coreference 
resolution. If a sentence is deemed to contain the 
target and its next sentence starts with “he”, 
“she”, “it”, or “they”, then the next sentence is 
retained. 
Training language models: As mentioned 
above, we train language models using the ob-
tained ordered centroid for each question. 
Answer reranking: Once the language models 
and the candidate answers are ready for a given 
question, candidate answers are reranked based 
on the probabilities of the language models gen-
erating candidate answers.  
Removing redundancies: Repetitive and similar 
candidate sentences will be removed. Given a 
reranked candidate answer set CA, redundancy 
removing is conducted as follows: 
                                                 
5 We use the statistics from British National Corpus (BNC) 
site to approximate words’ IDF, 
http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/bnc-
readme.html.  
1084
Step 1: Initially set the result A={}, and get 
top j=1 element from CA and then 
add it to A, j=2. 
Step 2: Get the jth element from CA, de-
noted as CAj. Compute cosine simi-
larity between CAj and each ele-
ment i of A, which is expressed as 
sij. Then let sik=max{s1j, s2j, ..., sij}, 
if sik < threshold (we set it to 0.75), 
then add j to the set A. 
Step 3: If length of A exceeds a predefined 
threshold, exit; otherwise, j=j+1, 
go to Step 2. 
Figure 2. Algorithm for removing redundancy. 
5 Experiment & Evaluation 
In order to get comparable evaluation, we apply 
our approach to TREC 2003 definitional QA task. 
More details will be shown in the following sec-
tions. 
5.1 Experiment setup 
5.1.1 Dataset 
We employ the dataset from the TREC 2003 QA 
task. It includes the AQUAINT corpus of more 
than 1 million news articles from the New York 
Times (1998-2000), Associated Press (1998-
2000), Xinhua News Agency (1996-2000) and 50 
definitional question/answer pairs. In these 50 
definitional questions, 30 are for people (e.g., 
Aaron Copland), 10 are for organizations (e.g., 
Friends of the Earth) and 10 are for other entities 
(e.g., Quasars). We employ Lemur6 to retrieve 
relevant documents from the AQUAINT corpus. 
For each query, we return the top 500 documents. 
5.1.2 Evaluation metrics 
We adopt the evaluation metrics used in the 
TREC definitional QA task (Voorhees, 2003 and 
2004). TREC provides a list of essential and ac-
ceptable nuggets for answering each question. 
We use these nuggets to assess our approach. 
During this progress, two human assessors exam-
ine how many essential and acceptable nuggets 
are covered in the returned answers. Every ques-
tion is scored using nugget recall (NR) and an 
approximation to nugget precision (NP) based on 
answer length. The final score for a definition 
response is computed using F-Measure. In TREC 
2003, the β  parameter was set to 5 indicating 
that recall is 5 times as important as precision 
(Voorhees, 2003).  
                                                 
6 A free IR tool, http://www.lemurproject.org/ 
(14)                                )15( 5  )5( 2
2
NRNP
NRNPF
++
∗∗==β  
in which,  
(15)                     uggetsl answer n# essentia returnedl nuggets # essentiaNR =  
(16)      )(otherwise  , 1
)(                     ,1
  

 <=
length
allowance)(length - - 
 allowancelength 
NP  
where allowance = 100 * (# essential + # ac-
ceptable nuggets returned) and length = # non-
white space characters in strings returned. 
5.1.3 Baseline system 
We employ the TFIDF heuristics algorithm-
based approach as our baseline system, in which 
the candidate answers and the centroid are 
treated as a bag of words.  
(17)               ln
iiiii DF
NTFIDFTFweight ∗=∗=  
where TFi gives the occurrences of term i. DF i 7  
is the number of documents containing term i. N 
gives the total number of documents. 
For comparison purpose, the unigram model is 
adopted and its scoring function is similar with 
Equation (7). The main difference is that we only 
concern unigram probability P(ti|OC) in uni-
gram-based scoring function. 
For all systems, we empirically set the thresh-
old of answer length to 12 sentences for people 
targets (i.e., Aaron Copland), and 10 sentences 
for other targets (i.e., Quasars). 
5.2 Performance evaluation 
As the first evaluation, we assess the perform-
ance obtained by our language model method 
against the baseline system without query expan-
sion (QE). The evaluation results are shown in 
Table 1. 
 Average NR Average NP F(5) 
Baseline 
(TFIDF) 
0.469 0.221 0.432 
Unigram 0.508 
(+8.3%) 
0.204      
(-7.7%) 
0.459 
(+6.3%) 
Bigram 0.554 
(+18.1%) 
0.234 
(+5.9%) 
0.505 
(+16.9%) 
Biterm 0.567 
(+20.9%) 
0.222 
(+0.5%) 
0.511 
(+18.3%) 
Table 1. Comparisons without QE. 
From Table 1, it is easy to observe that the 
unigram, bigram and biterm-based approaches 
improve the F(5) by 6.3%, 16.9% and 18.3% 
against the baseline system respectively. At the 
same time, the bigram and biterm improves the 
                                                 
7 We also use British National Corpus (BNC) to estimate it. 
1085
F(5) by 10.0% and 11.3% against the unigram 
respectively. The unigram slightly outperform 
the baseline. We also notice that the biterm 
model improves slightly over the bigram model 
since it ignores the order of term-occurrence. 
This observation coincides with the experimental 
results of Srikanth and Srihari (2002). These re-
sults show that the bigram and biterm models 
outperform the VSM model and the unigram 
model dramatically. It is a clear indication that 
the language model which takes into account the 
term dependence among centroid vector is an 
effective way to rerank answers. 
As mentioned above, QE is involved in our 
system. In the second evaluation, we assess the 
performance obtained by the language model 
method against the baseline system with QE. We 
list the evaluation results in Table 2. 
 Average NR Average NP F(5) 
Baseline 
(QE) 
0.508 0.207 0.462 
Unigram 
(QE) 
0.518 
(+2.0%) 
0.223 
(+7.7%) 
0.472 
(+2.2%) 
Bigram 
(QE) 
0.573 
(+12.8%) 
0.228 
(+10.1%) 
0.518 
(+12.1%) 
Biterm 
(QE) 
0.582 
(+14.6%) 
0.240 
(+15.9%) 
0.531 
(+14.9%) 
 
Table 2. Comparisons with QE. 
From Table 2, we observe that, with QE, the 
bigram and biterm still outperform the baseline 
system (VSM) significantly by 12.1% (p8=0.03) 
and 14.9% (p=0.004) in F(5). Furthermore, the 
bigram and biterm perform significantly better 
than the unigram by 9.7% (p=0.07) and 12.5% 
(p=0.02) in F(5) respectively. This indicates that 
the term dependence is effective in keeping im-
proving the performance. It is easy to observe 
that the baseline is close to the unigram model 
since both two systems are based on the inde-
pendent assumption. We also notice that the 
biterm model improves slightly over the bigram 
model. At the same time, all of the four systems 
improve the performance against the correspond-
ing system without QE. The main reason is that 
the qualities of the centroid vector can be en-
hanced with QE. We are interested in the per-
formance comparison with or without QE for 
each system. Through comparison it is found that 
the baseline system relies on QE more heavily 
than our approach does. With QE, the baseline 
system improves the performance by 6.9% and 
the language model approaches improve the per-
formance by 2.8%, 2.6% and 3.9%, respectively.  
                                                 
8 T-Test has been performed. 
F(5) performance comparison between the 
baseline model and the biterm model for each of 
50 TREC questions is shown in Figure 3. QE is 
used in both the baseline system and the biterm 
system. 
F(5) performance comparision for each question (Both with QE)
0
0.2
0.4
0.6
0.8
1
1.2
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Question ID
F-5
 Sc
ore
Baseline Our Biterm LM
 
Figure 3. Biterm vs. Baseline. 
We are also interested in the comparison with 
the systems in TREC 2003. The best F(5) score 
returned by our proposed approach is 0.531, 
which is close to the top 1 run in TREC 2003 
(Voorhees, 2003). The F(5) score of the best sys-
tem is 0.555, reported by BBN’s system (Xu et 
al., 2003). In BBN’s experiments, the centroid 
vector was learned from the human made exter-
nal knowledge resources, such as encyclopedia 
and the web. Table 3 gives the comparison be-
tween our biterm model-based system with the 
BBN’s run with different β  values. 
F( β ) Score  
Run Tag β
=1 β =2 β =3 β =4 β =5 
BBN 0.310 0.423 0.493 0.532 0.555 
Ours 0.288 0.382 0.470 0.509 0.531 
Table 3. Comparison with BBN’s run. 
5.3 Case study 
A positive example returned by our proposed 
approach is given below. For Qid: 2304: “Who is 
Niels Bohr?”, the reference answers are given in 
Table 4 (only vital nuggets are listed): 
vital Danish 
vital Nuclear physicist 
vital Helped create atom bomb 
vital Nobel Prize winner 
Table 4. Reference answers for question 
“Who is Niels Bohr?”. 
Answers returned by the baseline system and 
our proposed system are presented in Table 5. 
System Returned answers (Partly) 
Baseline 
system 
1. ..., Niels Bohr, the great Danish scien-
tist 
2. ...the German physicist Werner 
Heisenberg and the Danish physicist 
1086
Niels Bohr 
3. ...took place between the Danish 
physicist Niels Bohr and his onetime 
protege, the German scientist ... 
4. ... two great physicists, the Dane Niels 
Bohr and Werner Heisenberg ... 
5. ... 
Proposed 
system 
1. ...physicist Werner Heisenberg travel 
to ... his colleague and old mentor, 
Niels Bohr, the great Danish scientist 
2. ... two great physicists, the Dane Niels 
Bohr and Werner Heisen-berg ... 
3. Today's Birthdays: ... Danish nuclear 
physicist and Nobel Prize winner Niels 
Bohr (1885-1962) 
4. the Danish atomic physicist, and his 
German pupil, Werner Heisenberg, the 
author of the uncertainty principle 
5. ... 
Table 5. Baseline vs. our system for question 
“Who is Niels Bohr?”. 
From Table 5, it can be seen that the baseline 
system returned only one vital nugget: Danish 
(here we don’t think that physicist is equal to 
nuclear physicist semantically). Our proposed 
system returned three vital nuggets: Danish, Nu-
clear physicist, and Nobel Prize winner. The an-
swer sentence “Today's Birthdays: ... Danish nu-
clear physicist and Nobel Prize winner Niels 
Bohr (1885-1962)” contains more descriptive 
information for the question target “Niels Bohr” 
and is ranked 3rd in the top 12 answers in our 
proposed system. 
5.4 Error analysis 
Although we have shown that the language 
model-based approach significantly improves the 
system performance, there is still plenty of room 
for improvement.  
1) Sparseness of search results derogated the 
learning of the ordered centroid: E.g.: Qid 
2348: “What is the medical condition shin-
gles?”, in which we treat the words “medical 
condition shingles” as the question target. 
We found that few sentences contain the tar-
get “medical condition shingles”. We found 
utilizing multiple search engines, such as 
MSN9, AltaVista10 might alleviate this prob-
lem. Besides, more effective smoothing 
techniques could be promising. 
2) Term ambiguity: for some queries, the irre-
lated documents are returned. E.g., for Qid 
2267: “Who is Alexander Pope?”, all docu-
ments returned from the IR tool Lemur for 
                                                 
9 http://www.msn.com 
10 http://www.altavista.com 
this question are about “Pope John Paul II”, 
not “Alexander Pope”. This may be caused 
by the ambiguity of the word “Pope”. In this 
case, term disambiguation or adding some 
constraint terms which are learned from the 
web to the query to the AQUAINT corpus 
might be helpful.  
6 Conclusions and Future Work 
In this paper, we presented a novel answer 
reranking method for definitional question. We 
use bigram and biterm language models to 
capture the term dependence. Our contribution 
can be summarized as follows: 
1) Word dependence is explored from ordered 
centroid learned from snippets of a search 
engine;  
2) Bigram and biterm models are presented to 
capture the term dependence and rerank can-
didate answers for definitional QA; 
3) Evaluation results show that both bigram and 
biterm models outperform the VSM and uni-
gram model significantly on TREC 2003 test 
set.  
In our experiments, centroid words were 
learned from the returned snippets of a web 
search engine. In the future, we are interested in 
enhancing the centroid learning using human 
knowledge sources such as encyclopedia. In ad-
dition, we will explore new smoothing tech-
niques to enhance the interpolation method in 
our current approach. 
7 Acknowledgements 
The authors are grateful to Dr. Cheng Niu, 
Yunbo Cao for their valuable suggestions on the 
draft of this paper. We are indebted to Shiqi 
Zhao, Shenghua Bao, Wei Yuan for their valu-
able discussions about this paper. We also thank 
Dwight for his assistance to polish the English. 
Thanks also go to anonymous reviewers whose 
comments have helped improve the final version 
of this paper. 
References 
E. Brill, J. Lin, M. Banko, S. Dumais and A. Ng. 2001. 
Data-Intensive Question Answering. In Proceed-
ings of the Tenth Text Retrieval Conference (TREC 
2001), Gaithersburg, MD, pp. 183-189. 
S. Blair-Goldensohn, K.R. McKeown and A. Hazen 
Schlaikjer. 2003. A Hybrid Approach for QA 
Track Definitional Questions. In Proceedings of 
the Tenth Text Retrieval Conference (TREC 2003), 
pp. 336-343. 
1087
S. F. Chen and J. T. Goodman. 1996. An empirical 
study of smoothing techniques for language model-
ing. In Proceedings of the 34th Annual Meeting of 
the ACL, pp. 310-318. 
Hang Cui, Min-Yen Kan and Tat-Seng Chua. 2004. 
Unsupervised Learning of Soft Patterns for Defini-
tional Question Answering. In Proceedings of the 
Thirteenth World Wide Web conference (WWW 
2004), New York, pp. 90-99. 
Guihong Cao, Jian-Yun Nie, and Jing Bai. 2005. Inte-
grating Word Relationships into Language Models. 
In Proceedings of the 28th Annual International 
ACM SIGIR Conference on Research and Devel-
opment of Information Retrieval (SIGIR 2005), 
Salvador, Brazil. 
Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu and 
Guihong Cao. 2004. Dependence language model 
for information retrieval. In Proceedings of the 27th 
Annual International ACM SIGIR Conference on 
Research and Development of Information Re-
trieval (SIGIR 2004), Sheffield, UK. 
Chin-Yew Lin. 2002. The Effectiveness of Dictionary 
and Web-Based Answer Reranking. In Proceed-
ings of the 19th International Conference on Com-
putational Linguistics (COLING 2002), Taipei, 
Taiwan. 
Lafferty, J. and Zhai, C. 2001. Document language 
models, query models, and risk minimization for 
information retrieval. In W.B. Croft, D.J. Harper, 
D.H. Kraft, & J. Zobel (Eds.), In Proceedings of 
the 24th Annual International ACM-SIGIR Confer-
ence on Research and Development in Information 
Retrieval, New Orleans, Louisiana, New York, 
pp.111-119. 
Magnini, B., Negri, M., Prevete, R., and Tanev, H. 
2002. Is It the Right Answer? Exploiting Web Re-
dundancy for Answer Validation. In Proceedings 
of  the  40th  Annual Meeting  of  the  Association  
for  Computational Linguistics (ACL-2002), Phila-
delphia, PA. 
Miller, D., Leek, T., and Schwartz, R. 1999. A hidden 
Markov model information retrieval system. In 
Proceedings of the 22nd Annual International ACM 
SIGIR Conference, pp. 214-221. 
K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2001.  
Bleu: a Method for Automatic Evaluation of Ma-
chine Translation. IBM Research Report rc22176 
(w0109022), Thomas J. Watson Research Center. 
Ponte, J., and Croft, W.B. 1998. A language modeling 
approach to information retrieval. In Proceedings 
of the 21st Annual International ACM-SIGIR Con-
ference on Research and Development in Informa-
tion Retrieval, New York, pp.275-281. 
J. Prager, D. Radev, and K. Czuba. 2001. Answering 
what-is questions by virtual annotation. In Pro-
ceedings of the Human Language Technology Con-
ference (HLT 2001), San Diego, CA. 
Deepak Ravichandran and Eduard Hovy. 2002. 
Learning Surface Text Patterns for a Question An-
swering System. In Proceedings of the 40th Annual 
Meeting of the ACL, pp. 41-47. 
Song, F., and Croft, W.B. 1999. A general language 
model for information retrieval. In Proceedings of 
the 22nd Annual International ACM-SIGIR Confer-
ence on Research and Development in Information 
Retrieval, New York, pp.279-280. 
Srikanth, M. and Srihari, R. 2002. Biterm language 
models for document retrieval. In Proceedings of 
the 2002 ACM SIGIR Conference on Research and 
Development in Information Retrieval,  Tampere, 
Finland. 
Ellen M. Voorhees. 2002. Overview of the TREC 
2002 question answering track. In Proceedings of 
the Eleventh Text REtrieval Conference (TREC 
2002). 
Ellen M. Voorhees. 2003. Overview of the TREC 
2003 question answering track. In Proceedings of 
the Twelfth Text REtrieval Conference (TREC 
2003). 
Ellen M. Voorhees. 2004. Overview of the TREC 
2004 question answering track. In Proceedings of 
the Twelfth Text REtrieval Conference (TREC 
2004). 
Lide Wu, Xuanjing Huang, Lan You, Zhushuo Zhang, 
Xin Li, and Yaqian Zhou. 2004. FDUQA on 
TREC2004 QA Track. In Proceedings of the Thir-
teenth Text REtrieval Conference (TREC 2004). 
Jinxi Xu, Ana Licuanan, and Ralph Weischedel. 2003. 
TREC2003 QA at BBN: Answering definitional 
questions. In Proceedings of the Twelfth Text RE-
trieval Conference (TREC 2003). 
Jun Xu, Yunbo Cao, Hang Li and Min Zhao. 2005. 
Ranking Definitions with Supervised Learning 
Methods. In Proceedings of 14th International 
World Wide Web Conference (WWW 2005), Indus-
trial and Practical Experience Track, Chiba, Japan, 
pp.811-819. 
Zhang D. and Lee WS. 2003. A Language Modeling 
Approach to Passage Question Answering. In Pro-
ceedings of The 12th Text Retrieval Conference 
(TREC2003), NIST, Gaithersburg. 
Zhai, C, and Lafferty, J. 2001. A Study of Smoothing 
Methods for Language Models Applied to Informa-
tion Retrieval. In Proceedings of the 2001 ACM 
SIGIR Conference on Research and Development 
in Information Retrieval, pp. 334-342. 
1088
