Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in NLP, pages 65–72,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Studying Feature Generation from Various Data Representations for 
Answer Extraction 
 
Dan Shen
†‡
 Geert-Jan M. Kruijff
†
 Dietrich Klakow
‡
 
† 
Department of Computational Linguistics 
Saarland University 
Building 17,Postfach 15 11 50 
66041 Saarbruecken, Germany 
‡ 
Lehrstuhl Sprach Signal Verarbeitung 
Saarland University 
Building 17, Postfach 15 11 50 
66041 Saarbruecken, Germany 
{dshen,gj}@coli.uni-sb.de 
{dietrich.klakow}@lsv.uni-saarland.de 
 
 
Abstract 
In this paper, we study how to generate 
features from various data representations, 
such as surface texts and parse trees, for 
answer extraction.  Besides the features 
generated from the surface texts, we 
mainly discuss the feature generation in 
the parse trees.  We propose and compare 
three methods, including feature vector, 
string kernel and tree kernel, to represent 
the syntactic features in Support Vector 
Machines.  The experiment on the TREC 
question answering task shows that the 
features generated from the more struc-
tured data representations significantly 
improve the performance based on the 
features generated from the surface texts.  
Furthermore, the contribution of the indi-
vidual feature will be discussed in detail. 
1 Introduction 
Open domain question answering (QA), as defined 
by the TREC competitions (Voorhees, 2003), 
represents an advanced application of natural lan-
guage processing (NLP).  It aims to find exact an-
swers to open-domain natural language questions 
in a large document collection.  For example: 
Q2131: Who is the mayor of San Francisco? 
Answer: Willie Brown 
A typical QA system usually consists of three 
basic modules: 1. Question Processing (QP) Mod-
ule, which finds some useful information from the 
questions, such as expected answer type and key 
words.  2. Information Retrieval (IR) Module, 
which searches a document collection to retrieve a 
set of relevant sentences using the question key 
words.  3. Answer Extraction (AE) Module, which 
analyzes the relevant sentences using the informa-
tion provided by the QP module and identify the 
answer phrase. 
In recent years, QA systems trend to be more 
and more complex, since many other NLP tech-
niques, such as named entity recognition, parsing, 
semantic analysis, reasoning, and external re-
sources, such as WordNet, web, databases, are in-
corporated.  The various techniques and resources 
may provide the indicative evidences to find the 
correct answers.  These evidences are further com-
bined by using a pipeline structure, a scoring func-
tion or a machine learning method. 
In the machine learning framework, it is critical 
but not trivial to generate the features from the 
various resources which may be represented as 
surface texts, syntactic structures and logic forms, 
etc.  The complexity of feature generation strongly 
depends on the complexity of data representation.  
Many previous QA systems (Echihabi et al., 2003; 
Ravichandran, et al., 2003; Ittycheriah and Roukos, 
2002; Ittycheriah, 2001; Xu et al., 2002) have well 
studied the features in the surface texts.  In this 
paper, we will use the answer extraction module of 
QA as a case study to further explore how to gen-
erate the features for the more complex sentence 
representations, such as parse tree.  Since parsing 
gives the deeper understanding of the sentence, the 
features generated from the parse tree are expected 
to improve the performance based on the features 
generated from the surface text.  The answer ex-
65
traction module is built using Support Vector Ma-
chines (SVM).  We propose three methods to rep-
resent the features in the parse tree: 1. features are 
designed by domain experts, extracted from the 
parse tree and represented as a feature vector; 2. 
the parse tree is transformed to a node sequence 
and a string kernel is employed; 3. the parse tree is 
retained as the original representation and a tree 
kernel is employed. 
Although many textual features have been used 
in the others’ AE modules, it is not clear that how 
much contribution the individual feature makes.  In 
this paper, we will discuss the effectiveness of 
each individual textual feature in detail.  We fur-
ther evaluate the effectiveness of the syntactic fea-
tures we proposed.  Our experiments using TREC 
questions show that the syntactic features improve 
the performance by 7.57 MRR based on the textual 
features.  It indicates that the new features based 
on a deeper language understanding are necessary 
to push further the machine learning-based QA 
technology.  Furthermore, the three representations 
of the syntactic features are compared.  We find 
that keeping the original data representation by 
using the data-specific kernel function in SVM 
may capture the more comprehensive evidences 
than the predefined features.  Although the features 
we generated are specific to the answer extraction 
task, the comparison between the different feature 
representations may be helpful to explore the syn-
tactic features for the other NLP applications. 
2 Related Work 
In the machine learning framework, it is crucial to 
capture the useful evidences for the task and inte-
grate them effectively in the model.  Many re-
searchers have explored the rich textual features 
for the answer extraction. 
IBM (Ittycheriah and Roukos, 2002; Ittycheriah, 
2001) used a Maximum Entropy model to integrate 
the rich features, including query expansion fea-
tures, focus matching features, answer candidate 
co-occurrence features, certain word frequency 
features, named entity features, dependency rela-
tion features, linguistic motivated features and sur-
face patterns.  ISI’s (Echihabi et al. 2003; Echihabi 
and Marcu, 2003) statistical-based AE module im-
plemented a noisy-channel model to explain how a 
given sentence tagged with an answer can be re-
written into a question through a sequence of sto-
chastic operations.  (Ravichandran et al., 2003) 
compared two maximum entropy-based QA sys-
tems, which view the AE as a classification prob-
lem and a re-ranking problem respectively, based 
on the word frequency features, expected answer 
class features, question word absent features and 
word match features.  BBN (Xu et al. 2002) used a 
HMM-based IR system to score the answer candi-
dates based on the answer contexts.  They further 
re-ranked the scored answer candidates using the 
constraint features, such as whether a numerical 
answer quantifies the correct noun, whether the 
answer is of the correct location sub-type and 
whether the answer satisfies the verb arguments of 
the questions.  (Suzuki et al. 2002) explored the 
answer extraction using SVM. 
However, in the previous statistical-based AE 
modules, most of the features were extracted from 
the surface texts which are mainly based on the 
key words/phrases matching and the key word fre-
quency statistics.  These features only capture the 
surface-based information for the proper answers 
and may not provide the deeper understanding of 
the sentences.  In addition, the contribution of the 
individual feature has not been evaluated by them.  
As for the features extracted from the structured 
texts, such as parse trees, only a few works ex-
plored some predefined syntactic relation features 
by partial matching.  In this paper, we will explore 
the syntactic features in the parse trees and com-
pare the different feature representations in SVM.  
Moreover, the contributions of the different fea-
tures will be discussed in detail. 
3 Answer Extraction 
Given a question Q and a set of relevant sentences 
SentSet which is returned by the IR module, we 
consider all of the base noun phrases and the words 
in the base noun phrases as answer candidates ac
i
.  
For example, for the question “Q1956: What coun-
try is the largest in square miles?”, we extract the 
answer candidates { Russia, largest country, larg-
est, country, world, Canada, No.2.} in the sentence 
“I recently read that Russia is the largest country 
in the world, with Canada No. 2.”  The goal of the 
AE module is to choose the most probable answer 
from a set of answer candidates 
12
{ , ,... }
m
ac ac ac  
for the question Q. 
We regard the answer extraction as a classifica-
tion problem, which classify each question and 
66
answer candidate pair <Q, ac
i
> into the positive 
class (the correct answer) and the negative class 
(the incorrect answer), based on some features.  
The predication for each <Q, ac
i
> is made inde-
pendently by the classifier, then, the ac with the 
most confident positive prediction is chosen as the 
answer for Q.  SVM have shown the excellent per-
formance for the binary classification, therefore, 
we employ it to classify the answer candidates.   
Answer extraction is not a trivial task, since it 
involves several components each of which is 
fairly sophisticated, including named entity recog-
nition, syntactic / semantic parsing, question analy-
sis, etc.  These components may provide some 
indicative evidences for the proper answers.  Be-
fore generating the features, we process the sen-
tences as follows: 
1. tag the answer sentences with named entities. 
2. parse the question and the answer sentences us-
ing the Collins’ parser (Collin, 1996). 
3. extract the key words from the questions, such 
as the target words, query words and verbs. 
In the following sections, we will briefly intro-
duce the machine learning algorithm.  Then, we 
will discuss the features in detail, including the 
motivations and representations of the features. 
4 Support Vector Machines  
Support Vector Machines (SVM) (Vapnik, 1995) 
have strong theoretical motivation in statistical 
learning theory and achieve excellent generaliza-
tion performance in many language processing 
applications, such as text classification (Joachims, 
1998). 
SVM constructs a binary classifier that predict 
whether an instance x (
n
∈wR) is positive 
( () 1f =x ) or negative ( ( ) 1f =−x ), where, an 
instance may be represented as a feature vector or 
a structure like sequence of characters or tree.  In 
the simplest case (linearly separable instances), the 
decision f( ) sgn( b )⋅+x= wx is made based 
on a separating hyperplane 0b⋅+=wx  (
n
∈wR, 
b∈R).  All instances lying on one side of the hy-
perplane are classified to a positive class, while 
others are classified to a negative class. 
Given a set of labeled training instances 
()()( ){ }
11 2 2
, , , ,..., ,
mm
D yy y= xx x , where 
n
i
∈xR 
and { }1, 1
i
y =−, SVM is to find the optimal hy-
perplane that separates the positive and negative 
training instances with a maximal margin.  The 
margin is defined as the distance from the separat-
ing hyperplane to the closest positive (negative) 
training instances.  SVM is trained by solving a 
dual quadratic programming problem. 
Practically, the instances are non-linearly sepa-
rable.  For this case, we need project the instances 
in the original space R
n
 to a higher dimensional 
space R
N
 based on the kernel function 
12 1 2
(, ) (),()K =<Φ Φ >xx x x ,where, ( ):
nN
Φ→xR R is 
a project function of the instance.  By this means, a 
linear separation will be made in the new space.  
Corresponding to the original space R
n
, a non-
linear separating surface is found.  The kernel 
function has to be defined based on the Mercer’s 
condition.  Generally, the following kernel func-
tions are widely used. 
Polynomial kernel: (, ) ( 1)
p
ij i j
k =⋅+xx xx  
Gaussian RBF kernel:  
2
2
2
(, )
ij
-
ij
ke
σ−
=
xx
xx  
5 Textual Features 
Since the features extracted from the surface texts 
have been well explored by many QA systems 
(Echihabi et al., 2003; Ravichandran, et al., 2003; 
Ittycheriah and Roukos, 2002; Ittycheriah, 2001; 
Xu et al., 2002), we will not focus on the textual 
feature generation in this paper.  Only four types of 
the basic features are used: 
1. Syntactic Tag Features: the features capture 
the syntactic/POS information of the words in 
the answer candidates.  For the certain ques-
tion, such as “Q1903: How many time zones 
are there in the world?”, if the answer candi-
date consists of the words with the syntactic 
tags “CD NN”, it is more likely to be the 
proper answer. 
2. Orthographic Features: the features capture 
the surface format of the answer candidates, 
such as capitalization, digits and lengths, etc.  
These features are motivated by the observa-
tions, such as, the length  of the answers are 
often less than 3 words for the factoid ques-
tions; the answers may not be the subse-
quences of the questions; the answers often 
contain digits for the certain questions. 
3. Named Entity Features: the features capture 
the named entity information of the answer 
67
candidates.  They are very effective for the 
who, when and where questions, such as, For 
“Q1950: Who created the literary character 
Phineas Fogg?“, the answer “Jules Verne” is 
tagged as a PERSON name in the sentences 
“Jules Verne 's Phileas Fogg made literary 
history when he traveled around the world in 
80 days in 1873.”.  For the certain question tar-
get, if the answer candidate is tagged as the 
certain type of named entity, one feature fires. 
4. Triggers: some trigger words are collected for 
the certain questions.  For examples, for 
“Q2156: How fast does Randy Johnson 
throw?”, the trigger word “mph” for the ques-
tion words “how fast” may help to identify the 
answer “98-mph” in “Johnson throws a 98-
mph fastball”. 
6 Syntactic Features 
In this section, we will discuss the feature genera-
tion in the parse trees.  Since parsing outputs the 
highly structured data representation of the sen-
tence, the features generated from the parse trees 
may provide the more linguistic-motivated expla-
nation for the proper answers.  However, it is not 
trivial to find the informative evidences from a 
parse tree. 
The motivation of the syntactic features in our 
task is that the proper answers often have the cer-
tain syntactic relations with the question key words.  
Table 1 shows some examples of the typical syn-
tactic relations between the proper answers (a) and 
the question target words (qtarget).  Furthermore, 
the syntactic relations between the answers and the 
different types of question key words vary a lot.  
Therefore, we capture the relation features for the 
different types of question words respectively.  The 
question words are divided into four types: 
 z Target word, which indicates the expected an-
swer type, such as “city” in “Q: What city is 
Disneyland in?”. 
 z Head word, which is extracted from how ques-
tions and indicates the expected answer head, 
such as “dog” in “Q210: How many dogs 
pull …?” 
 z Subject words, which are the base noun phrases 
of the question except the target word and the 
head word. 
 z Verb, which is the main verb of the question. 
To our knowledge, the syntactic relation fea-
tures between the answers and the question key 
words haven’t been explored in the previous ma-
chine learning-based QA systems.  Next, we will 
propose three methods to represent the syntactic 
relation features in SVM. 
6.1 Feature Vector 
It is the commonly used feature representation in 
most of the machine learning algorithms.  We pre-
define a set of syntactic relation features, which is 
an enumeration of some useful evidences of the 
answer candidates (ac) and the question key words 
in the parse trees.  20 syntactic features are manu-
ally designed in the task.  Some examples of the 
features are listed as follows, 
 z if the ac node is the same of the qtarget node, 
one feature fires. 
 z if the ac node is the sibling of the qtarget node, 
one feature fires. 
 z if the ac node the child of the qsubject node, 
one feature fires. 
The limitation of the manually designed features is 
that they only capture the evidences in the local 
context of the answer candidates and the question 
key words.  However, some question words, such 
as subject words, often have the long range syntac-
1. a node is the same as the qtarget node and qtarget is the hypernym of a. 
Q: What city is Disneyland in? 
S: Not bad for a struggling actor who was working at Tokyo Disneyland a few years ago. 
2. a node is the parent of qtarget node. 
Q: What is the name of the airport in Dallas Ft. Worth? 
S: Wednesday morning, the low temperature at the Dallas-Fort Worth International Airport was 81 degrees. 
3. a node is the sibling of the qtarget node. 
Q: What book did Rachel Carson write in 1962? 
S: In her 1962 book Silent Spring, Rachel Carson, a marine biologist, chronicled DDT 's poisonous effects, …. 
Table 1: Examples of the typical relations between answer and question target word.  In Q, the italic word is 
question target word.  In S, the italic word is the question target word which is mapped in the answer sentence; 
the underlined word is the proper answer for the question Q. 
68
Figure 1: An example of the path from the answer 
candidate node to the question subject word node 
tic relations with the answers.  To overcome the 
limitation, we will propose some special kernels 
which may keep the original data representation 
instead of explicitly enumerate the features, to ex-
plore a much larger feature space. 
6.2 String Kernel 
The second method represents the syntactic rela-
tion as a linked node sequence and incorporates a 
string kernel in SVM to handle the sequence. 
We extract a path from the node of the answer 
candidate to the node of the question key word in 
the parse tree.  The path is represented as a node 
sequence linked by symbols indicating upward or 
downward movement through the tree. For exam-
ple, in Figure 1, the path from the answer candi-
date node “211,456 miles” to the question subject 
word node “the moon” is 
“ NPB ADVP VP S NPB↑↑↑↓”, where “ ↑ ” and 
“ ↓ ” indicate upward movement and downward 
movement in the parse tree.  By this means, we 
represent the object from the original parse tree to 
the node sequence.  Each character of the sequence 
is a syntactic/POS tag of the node.  Next, a string 
kernel will be adapted to our task to calculate the 
similarity between two node sequences. 
 
 
 
 
 
(Haussler, 1999) first described a convolution 
kernel over the strings.  (Lodhi et al., 2000) applied 
the string kernel to the text classification.  (Leslie 
et al., 2002) further proposed a spectrum kernel, 
which is simpler and more efficient than the previ-
ous string kernels, for protein classification prob-
lem.  In their tasks, the string kernels achieved the 
better performance compared with the human-
defined features. 
The string kernel is to calculate the similarity 
between two strings.  It is based on the observation 
that the more common substrings the strings have, 
the more similar they are.  The string kernel we 
used is similar to (Leslie et al., 2002).  It is defined 
as the sum of the weighted common substrings.  
The substring is weighted by an exponentially de-
caying factor λ (set 0.5 in the experiment) of its 
length k.  For efficiency, we only consider the sub-
strings which length are less than 3.  Different 
from (Leslie et al., 2002), the characters (syntac-
tic/POS tag) of the string are linked with each 
other.  Therefore, the matching between two sub-
strings will consider the linking information.  Two 
identical substrings will not only have the same 
syntactic tag sequences but also have the same 
linking symbols.  For example, for the node se-
quences NP VP VP S NP↑↑↑↓  and NP NP VP NP↑↑↓ , 
there is a matched substring (k =  2 ) :  NP VP↑ . 
6.3 Tree Kernel 
The third method keeps the original representation 
of the syntactic relation in the parse tree and incor-
porates a tree kernel in SVM. 
Tree kernels are the structure-driven kernels to 
calculate the similarity between two trees.  They 
have been successfully accepted in the NLP appli-
cations.  (Collins and Duffy, 2002) defined a ker-
nel on parse tree and used it to improve parsing.  
(Collins, 2002) extended the approach to POS tag-
ging and named entity recognition.  (Zelenko et al., 
2003; Culotta and Sorensen, 2004) further ex-
plored tree kernels for relation extraction. 
We define an object (a relation tree) as the 
smallest tree which covers one answer candidate 
node and one question key word node.  Suppose 
that a relation tree T has nodes 
01
{ , ,..., }
n
tt t  and 
each node 
i
t is attached with a set of attrib-
utes
01
{ , ,..., }
m
aa a , which represents the local char-
acteristics of t
i
.  In our task, the set of the 
attributes includes Type attributes, Orthographic 
attributes and Relation Role attributes, as shown in 
Table 2.  The core idea of the tree kernel (, )
12
K TT  
is that the similarity between two trees T
1
 and T
2
 is 
PUNC
. away 221,456 miles 
S 
PP NPB VP 
VBZ ADVP
NPB RB 
the moon 
is 
Q1980: How far is the moon from Earth in miles? 
S: At its perigee, the closest approach to Earth , the 
moon is 221,456 miles away. 
…… 
69
T1_ac#target 
T2_ac#target 
Q1897: What is the name of the airport in Dallas Ft. Worth? 
S: Wednesday morning, the low temperature at the Dallas-Fort 
Worth International Airport was 81 degrees. 
t
4
t
3
 t
2
T: BNP 
O: null  
R1: true 
R2: false 
t
1
Dallas-Fort
T: NNP 
O: CAPALL 
R1: false 
R2: false
International 
T: JJ 
O: CAPALL  
R1: false 
R2: false 
Airport 
T: NNP 
O: CAPALL 
R1: false 
R2: true
Q35: What is the name of the highest mountain in Africa? 
S: Mount Kilimanjaro, at 19,342 feet, is Africa's highest moun-
tain, and is 5,000 feet higher than …. 
Mount 
T: NNP 
O: CAPALL 
R1: false 
R2: true
Kilimanjaro
T: NNP 
O: CAPALL 
R1: false 
R2: false 
T: BNP 
O: null  
R1: true 
R2: false 
t
0
 
w
0
 
w
1 w
2
 
Worth 
T: NNP 
O: CAPALL 
R1: false 
R2: false
the sum of the similarity between their subtrees.  It 
is calculated by dynamic programming and cap-
tures the long-range syntactic relations between 
two nodes.  The kernel we use is similar to (Ze-
lenko et al., 2003) except that we define a task-
specific matching function and similarity function, 
which are two primitive functions to calculate the 
similarity between two nodes in terms of their at-
tributes. 
Matching function 
1 if . .  and . .   
(, )
0 otherwise                                           
ij ij
ij
t type t type t role t role
mt t
==
=



 
Similarity function 
0
{ ,..., }
(, ) (., .)
ij i j
m
aa a
s tt ftata
∈
=
∑
 
where, (., .)
ij
ftata is a compatibility function be-
tween two feature values 
..
(., .)
1   if 
0   otherwise
ij
ij
ta ta
ftata=
=



 
Figure 2 shows two examples of the relation tree 
T1_ac#targetword and T2_ac#targetword.  The 
kernel we used matches the following pairs of the 
nodes <t
0
, w
0
>, <t
1
, w
2
>, <t
2
, w
2
> and <t
4
, w
1
>. 
 
Attributes Examples 
POS tag CD, NNP, NN…Type 
syntactic tag NP, VP, … 
Is Digit? DIG, DIGALL 
Is Capitalized? CAP, CAPALL 
Ortho-
graphic  
length of phrase LNG1, LNG2#3, 
LNGgt3 
Role1 Is answer candidate? true, false 
Role2 Is question key words? true, false 
Table 2: Attributes of the nodes 
7 Experiments 
We apply the AE module to the TREC QA task.  
To evaluate the features in the AE module inde-
pendently, we suppose that the IR module has got 
100% precision and only passes those sentences 
containing the proper answers to the AE module.  
The AE module is to identify the proper answers 
from the given sentence collection. 
We use the questions of TREC8, 9, 2001 and 
2002 for training and the questions of TREC2003 
for testing.  The following steps are used to gener-
ate the data: 
1. Retrieve the relevant documents for each ques-
tion based on the TREC judgments. 
2. Extract the sentences, which match both the 
proper answer and at least one question key word, 
from these documents. 
3. Tag the proper answer in the sentences based on 
the TREC answer patterns 
 
Figure 2: Two objects representing the relations be-
tween answer candidates and target words. 
 
In TREC 2003, there are 413 factoid questions 
in which 51 questions (NIL questions) are not re-
turned with the proper answers by TREC.  Accord-
ing to our data generation process, we cannot 
provide data for those NIL questions because we 
cannot get the sentence collections.  Therefore, the 
AE module will fail on all of the NIL questions 
and the number of the valid questions should be 
362 (413 – 51).  In the experiment, we still test the 
module on the whole question set (413 questions) 
to keep consistent with the other’s work.  The 
training set contains 1252 questions.  The perform-
ance of our system is evaluated using the mean 
reciprocal rank (MRR).  Furthermore, we also list 
the percentages of the correct answers respectively 
70
in terms of the top 5 answers and the top 1 answer 
returned.  We employ the SVM
Light 
(Joachims, 
1999) to incorporate the features and classify the 
answer candidates.  No post-processes are used to 
adjust the answers in the experiments. 
Firstly, we evaluate the effectiveness of the tex-
tual features, described in Section 5.  We incorpo-
rate them into SVM using the three kernel 
functions: linear kernel, polynomial kernel and 
RBF kernel, which are introduced in Section 4.  
Table 3 shows the performance for the different 
kernels.  The RBF kernel (46.24 MRR) signifi-
cantly outperforms the linear kernel (33.72 MRR) 
and the polynomial kernel (40.45 MRR).  There-
fore, we will use the RBF kernel in the rest ex-
periments. 
 Top1 Top5 MRR 
linear 31.28 37.91 33.72 
polynomial 37.91 44.55 40.45 
RBF 42.67 51.58 46.24 
Table 3: Performance for kernels 
 
In order to evaluate the contribution of the indi-
vidual feature, we test out module using different 
feature combinations, as shown in Table 4.  Sev-
eral findings are concluded: 
1. With only the syntactic tag features F
syn.
, the 
module achieves a basic level MRR of 31.38.  The 
questions “Q1903: How many time zones are there 
in the world?“ is correctly answered from the sen-
tence “The world is divided into 24 time zones.”.  
2. The orthographic features F
orth.
 show the posi-
tive effect with 7.12 MRR improvement based on 
F
syn.
.  They help to find the proper answer “Grover 
Cleveland” for the question “Q2049: What presi-
dent served 2 nonconsecutive terms?” from the 
sentence “Grover Cleveland is the forgotten two-
term American president.”, while F
syn.
 wrongly 
identify “president” as the answer. 
3. The named entity features F
ne 
are also benefi-
cial as they make the 4.46 MRR increase based on 
F
syn.
+F
orth. 
  For the question “Q2076: What com-
pany owns the soft drink brand "Gatorade"?”, F
ne
 
find the proper answer “Quaker Oats” in the sen-
tence “Marineau , 53 , had distinguished himself 
by turning the sports drink Gatorade into a mass 
consumer brand while an executive at Quaker Oats 
During his 18-month…”, while F
syn.
+F
orth.
 return 
the wrong answer “Marineau”. 
4. The trigger features F
trg
 lead to an improve-
ment of 3.28 MRR based on F
syn.
+F
orth
+F
ne
.  They 
correctly answer more questions.  For the question 
“Q1937: How fast can a nuclear submarine 
travel?”, F
trg
 return the proper answer “25 knots” 
from the sentence “The submarine , 360 feet 
( 109.8 meters ) long , has 129 crew members and 
travels at 25 knots.”, but the previous features fail 
on it. 
F
syn
F
orth. 
F
ne
F
trg 
Top1 Top5 MRR
√    26.50 38.92 31.38
√ √   34.69 43.61 38.50
√ √ √  39.85 47.82 42.96
√ √ √ √ 42.67 51.58 46.24
Table 4: Performance for feature combinations 
 
Next, we will evaluate the effectiveness of the syn-
tactic features, described in Section 6.  Table 5 
compares the three feature representation methods, 
FeatureVector, StringKernel and TreeKernel.   
 z FeatureVector (Section 6.1).  We predefine 
some features in the syntactic tree and present 
them as a feature vector.  The syntactic fea-
tures are added with the textual features and 
the RBF kernel is used to cope with them. 
 z StringKernel (Section 6.2).  No features are 
predefined.  We transform the syntactic rela-
tions between answer candidates and question 
key words to node sequences and a string ker-
nel is proposed to cope with the sequences.  
Then we add the string kernel for the syntactic 
relations and the RBF kernel for the textual 
features. 
 z TreeKernel (Section 6.3).  No features are 
predefined.  We keep the original representa-
tions of the syntactic relations and propose a 
tree kernel to cope with the relation trees.  
Then we add the tree kernel and the RBF ker-
nel. 
 
Top1 Top2 MRR
F
syn.
+F
orth.
+F
ne
+F
trg
 42.67 51.58 46.24
FeatureVector 46.19 53.69 49.28
StringKernel 48.99 55.83 52.29
TreeKernel 50.41 57.46 53.81
Table 5: Performance for syntactic feature repre-
sentations 
 
Table 5 shows the performances of FeatureVec-
tor, StringKernel and TreeKernel.  All of them im-
prove the performance based on the textual 
features (F
syn.
+F
orth.
+F
ne
+F
trg
) by 3.04 MRR, 6.05 
MRR and 7.57 MRR respectively.  The probable 
reason may be that the features generated from the 
structured data representation may capture the 
71
more linguistic-motivated evidences for the proper 
answers.  For example, the syntactic features help 
to find the answer “nitrogen” for the question 
“Q2139: What gas is 78 percent of the earth 's at-
mosphere?” in the sentence “One thing they have-
n't found in the moon's atmosphere so far is 
nitrogen, the gas that makes up more than three-
quarters of the Earth's atmosphere.”, while the 
textual features fail on it.  Furthermore, the String-
Kernel (+3.01MRR) and TreeKernel (+4.53MRR) 
achieve the higher performance than FeatureVec-
tor, which may be explained that keeping the 
original data representations by incorporating the 
data-specific kernels in SVM may capture the 
more comprehensive evidences than the predefined 
features.  Moreover, TreeKernel slightly outper-
forms StringKernel by 1.52 MRR.  The reason may 
be that when we transform the representation of the 
syntactic relation from the tree to the node se-
quence, some information may be lost, such as the 
sibling node of the answer candidates.  Sometimes 
the information is useful to find the proper answers. 
8 Conclusion  
In this paper, we study the feature generation based 
on the various data representations, such as surface 
text and parse tree, for the answer extraction.  We 
generate the syntactic tag features, orthographic 
features, named entity features and trigger features 
from the surface texts.  We further explore the fea-
ture generation from the parse trees which provide 
the more linguistic-motivated evidences for the 
task.  We propose three methods, including feature 
vector, string kernel and tree kernel, to represent 
the syntactic features in Support Vector Machines.  
The experiment on the TREC question answering 
task shows that the syntactic features significantly 
improve the performance by 7.57MRR based on 
the textual features.  Furthermore, keeping the 
original data representation using a data-specific 
kernel achieves the better performance than the 
explicitly enumerated features in SVM. 

References  
M. Collins.  1996.  A New Statistical Parser Based on 
Bigram Lexical Dependencies.  In Proceedings of 
ACL-96, pages 184-191. 
M. Collins. 2002.  New Ranking Algorithms for Parsing 
and Tagging: Kernel over Discrete Structures, and 
the Voted Perceptron.  In Proceedings of ACL-2002. 
M. Collins and N. Duffy.  2002.  Convolution Kernels 
for Natural Language.  Advances in Neural Informa-
tion Processing Systems 14, Cambridge, MA.  MIT 
Press. 
A. Culotta and J. Sorensen.  2004.  Dependency Tree 
Kernels for Relation Extraction.  In Proceedings of 
ACL-2004. 
A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. 
Melz, D. Ravichandran.  2003.  Multiple-Engine 
Question Answering in TextMap.  In Proceedings of 
the TREC-2003 Conference, NIST. 
A. Echihabi, D. Marcu. 2003. A Noisy-Channel Ap-
proach to Question Answering. In Proceedings of the 
ACL-2003. 
D. Haussler. 1999.  Convolution Kernels on Discrete 
Structures.  Technical Report UCS-CRL-99-10, Uni-
versity of California, Santa Cruz. 
A. Ittycheriah and S. Roukos.  2002.  IBM’s Statistical 
Question Answering System – TREC 11.  In Pro-
ceedings of the TREC-2002 Conference, NIST. 
A. Ittycheriah. 2001. Trainable Question Answering 
System.  Ph.D. Dissertation, Rutgers, The State Uni-
versity of New Jersey, New Brunswick, NJ. 
T. Joachims.  1999.  Making large-Scale SVM Learn-
ing Practical.  Advances in Kernel Methods - Sup-
port Vector Learning, MIT-Press, 1999. 
T. Joachims.  1998.  Text Categorization with Support 
Vector Machines: Learning with Many Relevant Fea-
tures.  In Proceedings of the European Conference on 
Machine Learning, Springer. 
C. Leslie, E. Eskin and W. S. Noble.  2002.  The spec-
trum kernel: A string kernel for SVM protein classi-
fication.  Proceedings of the Pacific Biocomputing 
Symposium. 
H. Lodhi, J. S. Taylor, N. Cristianini and C. J. C. H. 
Watkins.  2000.  Text Classification using String 
Kernels.  In NIPS, pages 563-569. 
D. Ravichandran, E. Hovy and F. J. Och.  2003.  Statis-
tical QA – Classifier vs. Re-ranker: What’s the dif-
ference?  In Proceedings of Workshop on Mulingual 
Summarization and Question Answering, ACL 2003. 
J. Suzuki, Y. Sasaki, and E. Maeda. 2002. SVM Answer 
Selection for Open-domain Question Answering. In 
Proc. of COLING 2002, pages 974–980. 
V. N. Vapnik.  1998.  Statistical Learning Theory.  
Springer. 
E.M. Voorhees.  2003. Overview of the TREC 2003 
Question Answering Track.  In Proceedings of the 
TREC-2003 Conference, NIST. 
J. Xu, A. Licuanan, J. May, S. Miller and R. Weischedel.  
2002.  TREC 2002 QA at BBN: Answer Selection 
and Confidence Estimation.  In Proceedings of the 
TREC-2002 Conference, NIST. 
D. Zelenko, C. Aone and A. Richardella.  2003.  Kernel 
Methods for Relation Extraction.  Journal of Ma-
chine Learning Research, pages 1083-1106. 
