Question Classification using HDAG Kernel
Jun Suzuki, Hirotoshi Taira, Yutaka Sasaki, and Eisaku Maeda
NTT Communication Science Laboratories, NTT Corp.
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237 Japan
a0 jun, taira, sasaki, maeda
a1 @cslab.kecl.ntt.co.jp
Abstract
This paper proposes a machine learning
based question classification method us-
ing a kernel function, Hierarchical Di-
rected Acyclic Graph (HDAG) Kernel.
The HDAG Kernel directly accepts struc-
tured natural language data, such as sev-
eral levels of chunks and their relations,
and computes the value of the kernel func-
tion at a practical cost and time while re-
flecting all of these structures. We ex-
amine the proposed method in a ques-
tion classification experiment using 5011
Japanese questions that are labeled by
150 question types. The results demon-
strate that our proposed method improves
the performance of question classification
over that by conventional methods such as
bag-of-words and their combinations.
1 Introduction
Open-domain Question Answering (ODQA) in-
volves the extraction of correct answer(s) to a given
free-form factual question from a large collection
of texts. ODQA has been actively studied all over
the world since the start of the Question Answering
Track at TREC-8 in 1999.
The definition of ODQA tasks at the TREC QA-
Track has been revised and extended year after
year. At first, ODQA followed the Passage Retrieval
method as used at TREC-8. That is, the ODQA task
was to answer a question in the form of strings of
50 bytes or 250 bytes excerpted from a large set of
news wires. Recently, however, the ODQA task is
considered to be a task of extracting exact answers
to a question. For instance, if a QA system is given
the question “When was Queen Victoria born?”, it
should answer “1832”.
Typically, QA systems have the following compo-
nents for achieving ODQA:
Question analysis analyzes a given question and
determines the question type and keywords.
Text retrieval finds the top a2 paragraphs or docu-
ments that match the result of the question anal-
ysis component.
Answer candidate extraction extracts answer can-
didates of the given question from the docu-
ments retrieved by the text retrieval component,
based on the results of the question types.
Answer selection selects the most plausible an-
swer(s) to the given question from among the
answer candidates extracted by the answer can-
didate extraction component.
One of the most important processes of those
listed above is identifying the target of intention in a
given question to determine the type of sought-after
answer. This process of determining the question
type for a given question is usually called question
classification. Without a question type, that is, the
result of question classification, it would be much
more difficult or even nearly infeasible to select cor-
rect answers from among the possible answer can-
didates, which would necessarily be all of the noun
phrases or named entities in the texts. Question clas-
sification provides the benefit of a powerful restric-
tion that reduces to a practical number of the answer
candidates that should be evaluated in the answer se-
lection process.
This work develops a machine learning approach
to question classification (Harabagiu et al., 2000;
Hermjakob, 2001; Li and Roth, 2002). We use the
Hierarchical Directed Acyclic Graph (HDAG) Ker-
nel (Suzuki et al., 2003), which is suited to handle
structured natural language data. It can handle struc-
tures within texts as the features of texts without
converting the structures to the explicit representa-
tion of numerical feature vectors. This framework is
useful for question classification because the works
of (Li and Roth, 2002; Suzuki et al., 2002a) showed
that richer information, such as structural and se-
mantical information inside a given question, im-
proves the question classification performance over
using the information of just simple key terms.
In Section 2, we present the question classifica-
tion problem. In Section 3, we explain our proposed
method for question classification. Finally, in Sec-
tion 4, we describe our experiment and results.
2 Question Classification
Question classification is defined as a task that maps
a given question to more than one of a3 question
types (classes).
In the general concept of QA systems, the result
of question classification is used in a downstream
process, answer selection, to select a correct answer
from among the large number of answer candidates
that are extracted from the source documents. The
result of the question classification, that is, the la-
bels of the question types, can reduce the number
of answer candidates. Therefore, we no longer have
to evaluate every noun phrase in the source docu-
ments to see whether it provides a correct answer to
a given question. Evaluating only answer candidates
that match the results of question classification is an
efficient method of obtaining correct answers. Thus,
question classification is an important process of a
QA system. Better performance in question classi-
fication will lead to better total performance of the
QA system.
2.1 Question Types: Classes of Questions
Numerous question taxonomies have been defined,
but unfortunately, no standard exists.
In the case of the TREC QA-Track, most systems
have their own question taxonomy, and these are re-
constructed year by year. For example, (Ittycheriah
et al., 2001) defined 31 original question types in
two levels of hierarchical structure. (Harabagiu et
al., 2000) also defined a large hierarchical question
taxonomy, and (Hovy et al., 2001) defined 141 ques-
tion types of a hierarchical question taxonomy.
Within all of these taxonomies, question types are
defined from the viewpoint of the target intention of
the given questions, and they have hierarchical struc-
tures, even though these question taxonomies are de-
fined by different researchers. This because the pur-
pose of question classification is to reduce the large
number of answer candidates by restricting the tar-
get intention via question types. Moreover, it is very
useful to handle question taxonomy constructed in a
hierarchical structure in the downstream processes.
Thus, question types should be the target intention
and constructed in a hierarchical structure.
2.2 Properties
Question classification is quite similar to Text Cate-
gorization, which is one of the major tasks in Nat-
ural Language Processing (NLP). These tasks re-
quire classification of the given text to certain de-
fined classes. In general, in the case of text catego-
rization, the given text is one document, such as a
newspaper article, and the classes are the topics of
the articles. In the case of question classification,
a given text is one short question sentence, and the
classes are the target answers corresponding to the
intention of the given question.
However, question classification requires much
more complicated features than text categorization,
as shown by (Li and Roth, 2002). They proved that
question classification needs richer information than
simple key terms (bag-of-words), which usually give
us high performance in text classification. More-
over, the previous work of (Suzuki et al., 2002a)
showed that the sequential patterns constructed by
different levels of attributes, such as words, part-of-
speech (POS) and semantical information, improve
the performance of question classification. The ex-
periments in these previous works indicated that
the structural and semantical features inside ques-
tions have the potential to improve the performance
of question classification. In other words, high-
performance question classification requires us to
extract the structural and semantical features from
the given question.
2.3 Learning and Classification Task
This paper focuses on the machine learning ap-
proach to question classification. The machine
learning approach has several advantages over man-
ual methods.
First, the construction of a manual classifier for
questions is a tedious task that requires the analy-
sis of a large number of questions. Moreover, map-
ping questions into question types requires the use
of lexical items and, therefore, an explicit represen-
tation of the mapping may be very large. On the
other hand, machine learning approaches only need
to define features. Finally, the classifier can be more
flexibly reconstructed than a manual one because it
can be trained on a new taxonomy in a very short
time.
As the machine learning algorithm, we chose the
Support Vector Machines (SVMs) (Cortes and Vap-
nik, 1995) because the work of (Joachims, 1998;
Taira and Haruno, 1999) reported state-of-the-art
performance in text categorization as long as ques-
tion classification is a similar process to text catego-
rization.
3 HDAG Kernel
Recently, the design of kernel functions has become
a hot topic in the research field of machine learning.
A specific kernel can drastically increase the perfor-
mance of specific tasks. Moreover, a specific kernel
can handle new feature spaces that are difficult to
manage directly with conventional methods.
The HDAG Kernel is a new kernel function that
is designed to easily handle structured natural lan-
guage data. According to the discussion in the pre-
vious section, richer information such as structural
and semantical information is required for high-
performance question classification.
We think that the HDAG Kernel is suitable for
improving the performance of question classifica-
tion: The HDAG Kernel can handle various linguis-
tic structures within texts, such as chunks and their
relations, as the features of the text without convert-
ing such structures to numerical feature vectors ex-
plicitly.
3.1 Feature Space
Figure 1 shows examples of the structures within
questions that are handled by the HDAG kernel.
As shown in Figure 1, the HDAG kernel accepts
several levels of chunks and their relations inside the
text. The nodes represent several levels of chunks in-
cluding words, and directed links represent their re-
lations. Suppose a4a6a5a8a7a10a9a5a11a7a13a12a15a14a17a16 and a4a19a18a20a7a6a9a18a21a7a13a12a23a22a24a16 rep-
resent each node. Some nodes have a graph inside
themselves, which are called “non-terminal nodes”.
Each node can have more than one attribute, such
as words, part-of-speech tags, semantic information
like WordNet (Fellbaum, 1998), and class names of
the named entity. Moreover, nodes are allowed to
not have any attribute, in other words, we do not
have to assign attributes to all nodes.
The “attribute sequence” is a sequence of at-
tributes extracted from the node in sub-paths of
HDAGs. One type of attribute sequence becomes
one element in the feature vector. The framework of
the HDAG Kernel allows node skips during the ex-
traction of attribute sequences, and its cost is based
the decay factor a25a27a26a21a28a17a29a30a25a32a31a30a33a19a34 , since HDAG Kernel
deals with not only the exact matching of the sub-
structures between HDAGs but also the approximate
structure matching of them.
Explicit representation of feature vectors in
the HDAG kernel can be written as a35a36a26a21a37a24a34a39a38
a26a21a35a41a40a42a26a21a37a24a34a21a43a45a44a46a44a45a44a47a43a46a35a49a48a50a26a21a37a24a34a21a34 , where a35 represents the ex-
plicit feature mapping from the HDAG to the feature
vector and a2 represents the number of all possible
types of attribute sequences extracted to the HDAGs.
The value of a35 a7a26a20a37a50a34 is the number of occurrences of
thea51 ’th attribute sequence in the HDAGa37 , weighted
according to the node skip.
Table 1 shows a example of attribute sequences
that are extracted from the example question in Fig-
ure 1. The symbol a52 in the sub-path column shows
that more than one node skip occurred there. The
parentheses “( )” in the attribute sequence column
represents the boundaries of a node. For example,
attribute sequence “purchased-(NNP-Bush)” is ex-
q1 q6q4
q3
q8
q2
q5 q7
Question:  George Bush purchased a small interest in which baseball team ?
How far is it from Denver to Aspen
WRB RB VBZ PRP IN NNP TO NNP
LOCATION LOCATIONADVP
Question:  How far is it from Denver to Aspen ?
?.
WHADVP
q9 q10
p1
p2
p5p4
p3 p6 p7
George Bush purchased a small interest in which baseball team ?
NNP NNP VBD DT JJ NN IN WDT  NN  NN .
PERSON
NP
NP NPPP
p8
p9
p11
p10
p12 p13 p14
Figure 1: Example of text structure handled by HDAG (From the questions in TREC-10)
Table 1: Examples of attribute sequences, elements
of feature vectors, extracted from the example ques-
tion in Figure 1
sub-path attribute sequence: element value
a53a10a54 PERSON 1
a53a47a55 George 1
a53a55 NNP 1
a53a55 -a53a57a56 George-Bush 1
a53a47a55 -a53a56 NNP-Bush 1
a53a47a55 -a53a56 George-NNP 1
a53a55 -a53a57a56 NNP-NNP 1
..
.
..
.
..
.
a53a47a58 -a53 a54 purchased-(NNP-Bush) 1
a53a58 -a53a46a54 purchased-(PERSON) 1
a53a58 -a53a46a54 purchased-(Bush)
a59
..
.
..
.
..
.
a53a58 -
a60 -
a53a46a54a62a61 purchased-(NP)
a59
a58
..
.
..
.
..
.
a53a63a58 -a53a57a64 -
a60 -
a53 a54a62a61 VBD-(a-small-interest)-(which-baseball-team)
a59
a55
a53a58 -a53a64 -
a60 -
a53a46a54a62a61 purchased-(NP)-(which-team)
a59
a56
tracted from sub-path “a5a66a65 -a5a27a40 ”, and “NNP-Bush” is
in the nodea5a67a40 .
The return value of the HDAG Kernel can be de-
fined as:
a68a32a69a71a70a73a72a75a74a77a76a8a78a80a79a82a81a83a69a71a70a73a76a85a84a86a81a77a69a71a74a19a76a88a87a66a78a90a89
a91a93a92a11a94
a81
a91
a69a71a70a66a76a66a84a42a81
a91
a69a71a74a77a76a88a72 (1)
where input objects a95 and a96 are the objects rep-
resented in HDAG a37 a40 and a37a98a97 , respectively. Ac-
cording to this formula, the HDAG Kernel calculates
the inner product of the common attribute sequences
weighted according to their node skips and the oc-
currence between the two HDAGs, a37a32a40 and a37 a97 .
3.2 Efficient Calculation Method
In general, the dimension of the feature space a2 in
equation (1) becomes very high or even infinity. It
might thus be computationally infeasible to generate
feature vector a35a36a26a20a37a50a34 explicitly.
To solve this problem, we focus on the framework
of the kernel functions defined for a discrete struc-
ture, Convolution Kernels (Haussler, 1999). One
of the most remarkable properties of this kernel
methodology is that it can calculate kernel functions
by the “inner products between pairs of objects”
while it retains the original representation of objects.
This means that we do not have to map input objects
to the numerical feature vectors by explicitly repre-
senting them, as long as an efficient calculation for
the inner products between a pair of texts is defined.
However, Convolution Kernels are abstract con-
cepts. The Tree Kernel (Collins and Duffy, 2001)
and String Subsequence Kernel (SSK) (Lodhi et al.,
2002) are examples of instances in the Convolution
Kernels developed in the NLP field.
The HDAG Kernel also use this framework: we
can learn and classify without creating explicit nu-
merical feature vectors like equation (1). The effi-
cient calculation of inner products between HDAGs,
the return value of HDAG Kernel, was defined in a
recursive formulation (Suzuki et al., 2003). This re-
cursive formulation for HDAG Kernel can be rewrit-
ten as “for loops” by using the dynamic program-
ming technique. Finally, the HDAG Kernel can be
calculated ina99a50a26a100a9a14a32a9a101a9a22a102a9a34 time.
Table 2: Distribution of 5011 questions over question type hierarchy
a103 question type #
0 !TOP 4963
1 NAME 3190
2 PERSON 824
3 *LASTNAME 0
3 *MALE FIRSTNAME 1
3 *FEMALE FIRSTNAME 2
2 ORGANIZATION 733
3 COMPANY 119
3 *COMPANY GROUP 0
3 *MILITARY 4
3 INSTITUTE 26
3 *MARKET 0
3 POLITICAL ORGANIZATION 103
4 GOVERNMENT 38
4 POLITICAL PARTY 43
4 PUBLIC INSTITUTION 19
3 GROUP 96
4 !SPORTS TEAM 20
3 *ETHNIC GROUP 4
3 *NATIONALITY 4
2 LOCATION 752
3 GPE 265
4 CITY 77
4 *COUNTY 1
4 PROVINCE 47
4 COUNTRY 116
3 REGION 23
3 GEOLOGICAL REGION 22
4 *LANDFORM 9
4 *WATER FORM 7
4 *SEA 3
3 *ASTRAL BODY 5
4 *STAR 2
4 *PLANET 2
3 ADDRESS 59
4 POSTAL ADDRESS 24
4 PHONE NUMBER 22
4 *EMAIL 4
4 *URL 8
2 FACILITY 147
3 GOE 99
4 SCHOOL 27
4 *MUSEUM 3
4 *AMUSEMENT PARK 4
4 WORSHIP PLACE 10
4 STATION TOP 12
5 *AIRPORT 6
5 *STATION 3
5 *PORT 3
5 *CAR STOP 0
a103 question type #
3 LINE 24
4 *RAILROAD 3
4 !ROAD 11
4 *WATERWAY 0
4 *TUNNEL 1
4 *BRIDGE 1
3 *PARK 2
3 *MONUMENT 3
2 PRODUCT 468
3 VEHICLE 37
4 *CAR 8
4 *TRAIN 2
4 *AIRCRAFT 5
4 *SPACESHIP 8
4 !SHIP 12
3 DRUG 15
3 *WEAPON 4
3 *STOCK 0
3 *CURRENCY 8
3 AWARD 11
3 *THEORY 1
3 RULE 66
3 *SERVICE 2
3 *CHARCTER 4
3 METHOD SYSTEM 33
3 ACTION MOVEMENT 21
3 *PLAN 1
3 *ACADEMIC 5
3 *CATEGORY 0
3 SPORTS 11
3 OFFENCE 10
3 ART 78
4 *PICTURE 2
4 *BROADCAST PROGRAM 6
4 MOVIE 15
4 *SHOW 4
4 MUSIC 13
3 PRINTING 31
4 !BOOK 10
4 *NEWSPAPER 7
4 *MAGAZINE 4
2 DISEASE 44
2 EVENT 99
3 *GAMES 8
3 !CONFERENCE 17
3 *PHENOMENA 6
3 *WAR 3
3 *NATURAL DISASTER 5
3 *CRIME 6
2 TITLE 97
a103 question type #
3 !POSITION TITLE 97
2 *LANGUAGE 8
2 *RELIGION 6
1 NATURAL OBJECT 96
2 ANIMAL 18
2 VEGETABLE 15
2 MINERAL 54
1 COLOR 10
1 TIME TOP 779
2 TIMEX 652
3 TIME 50
3 DATE 594
3 *ERA 5
2 PERIODX 125
3 *TIME PERIOD 9
3 *DATE PERIOD 9
3 *WEEK PERIOD 4
3 *MONTH PERIOD 6
3 !YEAR PERIOD 41
1 NUMEX 882
2 MONEY 187
2 *STOCK INDEX 0
2 *POINT 9
2 PERCENT 94
2 MULTIPLICATION 10
2 FREQUENCY 27
2 *RANK 8
2 AGE 58
2 MEASUREMENT 133
3 PHYSICAL EXTENT 53
3 SPACE 18
3 VOLUME 14
3 WEIGHT 22
3 *SPEED 9
3 *INTENSITY 0
3 *TEMPERATURE 7
3 *CALORIE 1
3 *SEISMIC INTENSITY 2
2 COUNTX 326
3 N PERSON 162
3 N ORGANIZATION 49
3 N LOCATION 27
4 *N COUNTRY 9
3 *N FACILITY 6
3 N PRODUCT 47
3 *N EVENT 8
3 *N ANIMAL 7
3 *N VEGETABLE 0
3 *N MINERAL 0
0 *OTHER 48
4 Experiment
4.1 Data Set
We used three different QA data sets together to
evaluate the performance of our proposed method.
One is the 1011 questions of NTCIR-QAC11, which
were gathered from ’dry-run’, ’formal-run’ and
’additional-run.’ The second is the 2000 questions
described in (Suzuki et al., 2002b). The last one is
the 2000 questions of CRL-QA data2. These three
QA data sets are written in Japanese.
These data were labeled with the 150 question
types that are defined in the CRL-QA data, along
with one additional question type, “OTHER”. Ta-
ble 2 shows all of the question types we used in this
experiment, where a104 represents the depth of the hi-
1http://www.nlp.cs.ritsumei.ac.jp/qac/
2http://www.cs.nyu.edu/˜sekine/PROJECT/CRLQA/
erarchy and # represents the number of questions of
each question type, including the number of ques-
tions in “child question types”.
While considering question classification as a
learning and classification problem, we decided not
to use question types that do not have enough ques-
tions (more than ten questions), indicated by an as-
terisk (*) in front of the name of the question type,
because classifier learning is very difficult with very
few data. In addition, after the above operations, if
only one question type belongs to one parent ques-
tion type, we also deleted it, which is indicated by
an exclamation mark (!). Ultimately, we evaluated
68 question types.
4.2 Comparison Methods
We compared the HDAG Kernel (HDAG) to a base-
line method that is sometimes referred to as the bag-
of-words kernel, a bag-of-words (BOW) with a poly-
nomial kernel (d1: first degree polynomial kernel,
d2: second degree polynomial kernel).
HDAG and BOW differ in how they consider the
structures of a given question. BOW only con-
siders attributes independently (d1) or combinato-
rially (d2) in a given question. On the other hand,
HDAG can consider the structures (relations) of the
attributes in a given question.
We selected SVM for the learning and classifica-
tion algorithm. Additionally, we evaluated the per-
formance using SNoW3 to compare our method to
indirectly the SNoW-based question classifier (Li
and Roth, 2002). Note that BOW was used as fea-
tures for SNoW.
Finally, we compared the performances of
HDAG-SVM, BOW(d2)-SVM, BOW(d1)-SVM,
and BOW-SNoW. The parameters of each com-
parison method were set as follows: The decay
factor a25 was 0.5 for HDAG, and the soft-margin
a105 of all SVM was set to 1. For SNoW, we used
a106
a38a107a33a77a44a101a108a77a109a19a43a57a110a30a38a111a28a19a44a75a112 , and a113a114a38a111a108 . These parameters
were selected based on preliminary experiments.
4.3 Decision Model
Since the SVM is a two-class classification method,
we have to make a decision model to determine the
question type of a given question that is adapted for
question classification, which is a multi-class hierar-
chical classification problem.
Figure 2 shows how we constructed the final de-
cision model for question classification.
First, we made 68 SVM classifiers for each ques-
tion type, and then we constructed “one-vs-rest
models” for each node in the hierarchical question
taxonomy. One of the one-vs-rest models was con-
structed by some of the SVM classifiers, which were
the child question types of the focused node. For
example, the one-vs-rest model at the node “TOP”
was constructed by five SVM classifiers: “NAME”,
“NATURAL OBJECT”, “COLOR”, “TIME TOP”
and “NUMEX”. The total number of one-vs-rest
models was 17.
Finally, the decision model was constructed by
setting one-vs-rest models in the hierarchical ques-
tion taxonomy to determine the most plausible ques-
3http://l2r.cs.uiuc.edu/˜cogcomp/cc-software.html
Dicision Model
NAME
TOP
NATURAL_OBJECT NUMEX
: one SVM classifier
: one-vs-rest model :
constructed by the SVM 
classifiers of child QTs
PERSON
: label of question type
Figure 2: Hierarchical classifier constructed by
SVM classifiers
tion type of a given question.
4.4 Features
We set four feature sets for each comparison
method.
1. words only (W)
2. words and named entities (W+N)
3. words and semantic information (W+S)
4. words, named entities and semantic informa-
tion (W+N+S)
The words were analyzed in basic form, and the
semantic information was obtained from the “Goi-
taikei” (Ikehara et al., 1997), which is similar to
WordNet in English. Words, chunks and their rela-
tions in the texts were analyzed by CaboCha (Kudo
and Matsumoto, 2002), and named entities were an-
alyzed by the SVM-based NE tagger (Isozaki and
Kazawa, 2002).
Note that even when using the same feature sets,
method of how to construct feature spaces are en-
tirely different between HDAG and BOW.
4.5 Evaluation Method
We evaluated the 5011 questions by using five-
fold cross-validation and used the following two ap-
proaches to evaluate the performance.
Table 3: Results of question classification experi-
ment by five-fold cross-validation
Macc
W W+N W+S W+N+S
HDAG-SVM 0.862 0.871 0.877 0.882
BOW(d2)-SVM 0.841 0.847 0.847 0.856
BOW(d1)-SVM 0.839 0.843 0.837 0.851
BOW-SNoW 0.760 0.774 0.800 0.808
Qacc
W W+N W+S W+N+S
HDAG-SVM 0.730 0.736 0.742 0.749
BOW(d2)-SVM 0.678 0.691 0.686 0.704
BOW(d1)-SVM 0.679 0.686 0.671 0.694
BOW-SNoW 0.562 0.573 0.614 0.626
1. Average accuracy of each one-vs-rest model
(Macc)
This measure evaluates the performance of
each one-vs-rest model independently. If a one-
vs-rest model classifies a given question cor-
rectly, it scores a 1, otherwise, it scores a 0.
2. Average accuracy of each given question
(Qacc)
This measure evaluates the total performance
of the decision model, the question classifier.
If each given question is classified in a correct
question type, it scores a 1, otherwise, it scores
a 0.
In Qacc, classifying with a correct question type im-
plies that all of the one-vs-rest models from the top
of the hierarchy of the question taxonomy to the
given question type must classify correctly.
4.6 Results
Table 3 shows the results of our question classifi-
cation experiment, which is evaluated by five-fold
cross-validation.
5 Discussion
First, we could increase the performance by using
the information on named entities and semantic in-
formation compared to only using the words, which
is the same result given in (Li and Roth, 2002). This
result proved that high-performance question clas-
sification requires not only word features but also
many more types of information in the question.
Table 4: Accuracy of each question (Qacc) evalu-
ated at different depths of hierarchy in question tax-
onomya115
# of QTs W W+N W+S W+N+S
HDAG-SVM
1 5 0.946 0.944 0.953 0.948
2 25 0.795 0.794 0.800 0.803
3 55 0.741 0.743 0.751 0.756
4 68 0.730 0.736 0.742 0.749
BOW(d2)-SVM
1 5 0.906 0.914 0.908 0.925
2 25 0.736 0.748 0.748 0.763
3 55 0.687 0.698 0.695 0.712
4 68 0.678 0.691 0.686 0.704
BOW(d1)-SVM
1 5 0.906 0.918 0.905 0.917
2 25 0.736 0.752 0.730 0.752
3 55 0.688 0.697 0.678 0.701
4 68 0.679 0.686 0.671 0.694
BOW-SNoW
1 5 0.862 0.870 0.880 0.896
2 25 0.635 0.640 0.687 0.696
3 55 0.570 0.582 0.623 0.634
4 68 0.562 0.573 0.614 0.626
Second, our proposed method showed higher per-
formance than any method using BOW. This re-
sult indicates that the structural information in the
question, which includes several levels of chunks
and their relations, must provide powerful features
to classify the target intention of a given question.
We assume that such structural information must
provide shallow semantic information of the text.
Therefore, it is natural to improve the performance
to identify the intention of the question in order to
use the structural information in the manner of our
proposed method.
Table 4 shows the results of Qacc at each depth
of the question taxonomy. The results of depth a104
represent the total performance measured by Qacc,
considering only the uppera104 levels of question types
in the question taxonomy. If the depth goes lower,
all results show worse performance. There are sev-
eral reasons for this. One problem is the unbalanced
training data, where the lower depth question types
have fewer positive labeled samples (questions) as
shown in table 2. Moreover, during the classifica-
tion process misclassification is multiplied. Conse-
quently, if the upper-level classifier performed mis-
classification, we would no longer get a correct an-
swer, even though a lower-level classifier has the
ability to classify correctly. Thus, using a machine
learning approach (not only SVM) is not suitable
for deep hierarchically structured class labels. We
should arrange a question taxonomy that is suit-
able for machine learning to achieve the total per-
formance of question classification.
The performance by using SVM is better than
that by SNoW, even in handling the same feature of
BOW. One advantage of using SNoW is its much
faster learning and classifying speed than those of
SVM. We should thus select the best approach for
the purpose, depending on whether speed or accu-
racy is needed.
6 Conclusions
This paper presents a machine learning approach to
question classification. We proposed the HDAG ker-
nel, a new kernel function, that can easily handle
structured natural language data.
Our experimental results proved that features of
the structure in a given question, which can be com-
puted by the HDAG kernel, are useful for improving
the performance of question classification. This is
because structures inside a text provide the seman-
tic features of question that are required for high-
performance question classification.

References
M. Collins and N. Duffy. 2001. Convolution Kernels
for Natural Language. In Proc. of Neural Information
Processing Systems (NIPS’2001).
C. Cortes and V. N. Vapnik. 1995. Support Vector Net-
works. Machine Learning, 20:273–297.
C. Fellbaum. 1998. WordNet: An Electronic Lexical
Database. MIT Press.
S. Harabagiu, M. Pasca, and S. Maiorano. 2000. FAL-
CON: Boosting Knowledge for Answer Engines. In
Proc. of the 9th Text Retrieval Conference (TREC-9).
NIST.
D. Haussler. 1999. Convolution Kernels on Discrete
Structures. In Technical Report UCS-CRL-99-10. UC
Santa Cruz.
U. Hermjakob. 2001. Parsing and Question Classifica-
tions for Question Answering. In Proc. of the Work-
shop on Open-Domain Question Answering at ACL-
2001. ACL.
E. H. Hovy, L. Gerber, U. Hermjakob, C.-Y. Lin, and
D. Ravichandran. 2001. Toward Semantics-Based
Answer Pinpointing. In Proc. of the Human Language
Technology Conference (HLT2001).
S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo,
H. Nakaiwa, K. Ogura, Y. Oyama, and Y. Hayashi,
editors. 1997. The Semantic Attribute System, Goi-
Taikei — A Japanese Lexicon, volume 1. Iwanami
Publishing. (in Japanese).
H. Isozaki and H. Kazawa. 2002. Efficient Support
Vector Classifiers for Named Entity Recognition. In
Proc. of the 19th International Conference on Compu-
tational Linguistics (COLING 2002), pages 390–396.
A. Ittycheriah, M. Franz, and S. Roukos. 2001. IBM’s
Statistical Question Answering System – TREC-10.
In Proc. of TREC 2001. NIST.
T. Joachims. 1998. Text Categorization with Support
Vector Machines: Learning with Many Relevant Fea-
tures. In Proc. of European Conference on Machine
Learning(ECML ’98), pages 137–142.
T. Kudo and Y. Matsumoto. 2002. Japanese Depen-
dency Analysis using Cascaded Chunking. In Proc.
of the 6th Conference on Natural Language Learning
(CoNLL 2002), pages 63–69.
X. Li and D. Roth. 2002. Learning Question Classi-
fiers. In Proc. of the 19th International Conference
on Computational Linguistics (COLING 2002), pages
556–562.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini,
and C. Watkins. 2002. Text Classification Using
String Kernel. Journal of Machine Learning Research,
2:419–444.
J. Suzuki, Y. Sasaki, and E. Maeda. 2002a. Question
type classification using statistical machine learning.
In Forum on Information Technology (FIT2002), Infor-
mation Technology Letters (in Japanese), pages 89–90.
J. Suzuki, Y. Sasaki, and E. Maeda. 2002b. SVM Answer
Selection for Open-Domain Question Answering. In
Proc. of the 19th International Conference on Compu-
tational Linguistics (COLING 2002), pages 974–980.
J. Suzuki, T. Hirao, Y. Sasaki, and E. Maeda. 2003. Hi-
erarchical directed acyclic graph kernel: Methods for
natural language data. In Proc. of the 41st Annual
Meeting of the Association for Computational Linguis-
tics (ACL-2003), page to appear.
H. Taira and M. Haruno. 1999. Feature Selection in
SVM Text Categorization. In Proc. of the 16th Con-
ference of the American Association for Artificial In-
telligence (AAAI ’99), pages 480–486.
