Hierarchical Directed Acyclic Graph Kernel:
Methods for Structured Natural Language Data
Jun Suzuki, Tsutomu Hirao, Yutaka Sasaki, and Eisaku Maeda
NTT Communication Science Laboratories, NTT Corp.
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237 Japan
a0 jun, hirao, sasaki, maeda
a1 @cslab.kecl.ntt.co.jp
Abstract
This paper proposes the “Hierarchical Di-
rected Acyclic Graph (HDAG) Kernel” for
structured natural language data. The
HDAG Kernel directly accepts several lev-
els of both chunks and their relations,
and then efficiently computes the weighed
sum of the number of common attribute
sequences of the HDAGs. We applied the
proposed method to question classifica-
tion and sentence alignment tasks to eval-
uate its performance as a similarity mea-
sure and a kernel function. The results
of the experiments demonstrate that the
HDAG Kernel is superior to other kernel
functions and baseline methods.
1 Introduction
As it has become easy to get structured corpora such
as annotated texts, many researchers have applied
statistical and machine learning techniques to NLP
tasks, thus the accuracies of basic NLP tools, such
as POS taggers, NP chunkers, named entities tag-
gers and dependency analyzers, have been improved
to the point that they can realize practical applica-
tions in NLP.
The motivation of this paper is to identify and
use richer information within texts that will improve
the performance of NLP applications; this is in con-
trast to using feature vectors constructed by a bag-
of-words (Salton et al., 1975).
We now are focusing on the methods that use nu-
merical feature vectors to represent the features of
natural language data. In this case, since the orig-
inal natural language data is symbolic, researchers
convert the symbolic data into numeric data. This
process, feature extraction, is ad-hoc in nature and
differs with each NLP task; there has been no neat
formulation for generating feature vectors from the
semantic and grammatical structures inside texts.
Kernel methods (Vapnik, 1995; Cristianini and
Shawe-Taylor, 2000) suitable for NLP have recently
been devised. Convolution Kernels (Haussler, 1999)
demonstrate how to build kernels over discrete struc-
tures such as strings, trees, and graphs. One of the
most remarkable properties of this kernel method-
ology is that it retains the original representation
of objects and algorithms manipulate the objects
simply by computing kernel functions from the in-
ner products between pairs of objects. This means
that we do not have to map texts to the feature
vectors by explicitly representing them, as long as
an efficient calculation for the inner products be-
tween a pair of texts is defined. The kernel method
is widely adopted in Machine Learning methods,
such as the Support Vector Machine (SVM) (Vap-
nik, 1995). In addition, kernel function a2a4a3a6a5a8a7a10a9a12a11
has been described as a similarity function that
satisfies certain properties (Cristianini and Shawe-
Taylor, 2000). The similarity measure between texts
is one of the most important factors for some tasks in
the application areas of NLP such as Machine Trans-
lation, Text Categorization, Information Retrieval,
and Question Answering.
This paper proposes the Hierarchical Directed
Acyclic Graph (HDAG) Kernel. It can handle sev-
eral of the structures found within texts and can cal-
culate the similarity with regard to these structures
at practical cost and time. The HDAG Kernel can be
widely applied to learning, clustering and similarity
measures in NLP tasks.
The following sections define the HDAG Kernel
and introduce an algorithm that implements it. The
results of applying the HDAG Kernel to the tasks
of question classification and sentence alignment are
then discussed.
2 Convolution Kernels
Convolution Kernels were proposed as a concept of
kernels for a discrete structure. This framework de-
fines a kernel function between input objects by ap-
plying convolution “sub-kernels” that are the kernels
for the decompositions (parts) of the objects.
Let a13 be a positive integer and a14a15a7a16a14a18a17a19a7a21a20a22a20a21a20a10a7a10a14a24a23
be nonempty, separable metric spaces. This paper
focuses on the special case that a14a15a7a10a14 a17 a7a22a20a21a20a22a20a25a7a16a14 a23 are
countable sets. We start witha5a27a26a15a14 as a composite
structure and a28a30a29a31a5 a17 a7a22a20a21a20a22a20a10a7a16a5 a23 as its “parts”, where
a5a33a32a34a26a35a14a36a32 . a37 is defined as a relation on the seta14 a17a39a38
a40a22a40a21a40
a38 a14 a23a41a38 a14 such that a37a42a3a43a28a44a7a10a5a8a11 is true if a28 are the
“parts” of a5 . a37a46a45 a17 a3a47a5a8a11 is defined as a37a48a45 a17 a3a47a5a44a11a49a29a51a50a22a28a53a52
a37a54a3a55a28a44a7a10a5a8a11a19a56a57a20
Suppose a5a8a7a10a9a58a26a59a14 , a28 be the parts of a5 with
a28a60a29 a5 a17 a7a22a20a21a20a22a20a10a7a10a5 a23 , and a61 be the parts of a9 with
a61a62a29a63a9 a17 a7a21a20a22a20a21a20a19a7a16a9 a23 . Then, the similarity a2a64a3a47a5a8a7a10a9a12a11 be-
tween a5 and a9 is defined as the following general-
ized convolution:
a65a36a66a68a67a70a69a72a71a74a73a76a75
a77a79a78a79a80a82a81a70a83a16a84a86a85a10a87 a88a79a78a79a80a89a81a70a83a25a84a91a90a10a87
a92
a93a95a94a97a96
a65
a93
a66a68a67
a93
a69a98a71
a93
a73a55a99 (1)
We note that Convolution Kernels are abstract con-
cepts, and that instances of them are determined by
the definition of sub-kernel a2a35a32a57a3a47a5a100a32a57a7a10a9a74a32a89a11 . The Tree
Kernel (Collins and Duffy, 2001) and String Subse-
quence Kernel (SSK) (Lodhi et al., 2002), developed
in the NLP field, are examples of Convolution Ker-
nels instances.
An explicit definition of both the Tree Kernel and
SSK a2a64a3a47a5a8a7a10a9a12a11 is written as:
a65a36a66a68a67a70a69a101a71a57a73a102a75a4a103a105a104a106a66a68a67a70a73a108a107a109a104a57a66a68a71a74a73a55a110a111a75a59a112
a113a94a97a96
a104
a113
a66a68a67a111a73a111a107a114a104
a113
a66a68a71a57a73a55a99 (2)
Conceptually, we enumerate all sub-structures oc-
curring in a5 and a9 , where a115 represents the to-
tal number of possible sub-structures in the ob-
jects. a116 , the feature mapping from the sample
space to the feature space, is given by a116a100a3a6a5a8a11a62a29
a3a10a116 a17 a3a47a5a8a11a19a7a21a20a22a20a21a20a19a7a21a116a100a117a118a3a6a5a8a11a10a11a19a20
In the case of the Tree Kernel, a5 and a9 be trees.
The Tree Kernel computes the number of common
subtrees in two trees a5 and a9 . a116a100a119a95a3a6a5a8a11 is defined as
the number of occurrences of the a120 ’th enumerated
subtree in tree a5 .
In the case of SSK, input objects a5 and a9 are
string sequences, and the kernel function computes
the sum of the occurrences of a120 ’th common subse-
quence a116 a119 a3a47a5a8a11 weighted according to the length of the
subsequence. These two kernels make polynomial-
time calculations, based on efficient recursive cal-
culation, possible, see equation (1). Our proposed
method uses the framework of Convolution Kernels.
3 HDAG Kernel
3.1 Definition of HDAG
This paper defines HDAG as a Directed Acyclic
Graph (DAG) with hierarchical structures. That is,
certain nodes contain DAGs within themselves.
In basic NLP tasks, chunking and parsing are used
to analyze the text semantically or grammatically.
There are several levels of chunks, such as phrases,
named entities and sentences, and these are bound
by relation structures, such as dependency structure,
anaphora, and coreference. HDAG is designed to
enable the representation of all of these structures
inside texts, hierarchical structures for chunks and
DAG structures for the relations of chunks. We be-
lieve this richer representation is extremely useful to
improve the performance of similarity measure be-
tween texts, moreover, learning and clustering tasks
in the application areas of NLP.
Figure 1 shows an example of the text structures
that can be handled by HDAG. Figure 2 contains
simple examples of HDAG that elucidate the calcu-
lation of similarity.
As shown in Figures 1 and 2, the nodes are al-
lowed to have more than zero attributes, because
nodes in texts usually have several kinds of at-
tributes. For example, attributes include words, part-
of-speech tags, semantic information such as Word-
is of 
PERSON
NNP NNP VBZ
word named entity NP chunk
dependency structure
sentence 
coreference
.
...
Jun-ichi Tsujii general chair ACL2003the
He is one of the most famous
Junichi Tsujii is the Gereral Chair of ACL2003.
He is one of the most famous researchers in the NLP field.
:node
:direct link
DT JJ NN IN NNP
NPNP
PRP VBZ CD IN DT RBS JJ
NPNP
ORG
attribute: 
words
Part-of-speech tags
NP chunk
class of NE
Figure 1: Example of the text structures handled by
HDAG
p1 p2 p5p4p3
G1
G2
q1 q6q4q3
N
V
a b adc
N
e b ca dq8q2 q5 q7
p6 p7
NP
NP
Figure 2: Examples of HDAG structure
Net, and class of the named entity.
3.2 Definition of HDAG Kernel
First of all, we define the set of nodes in HDAGs
a121
a17 and
a121a123a122 as
a124 and a125 , respectively, a126 and a127 repre-
sent nodes in the graph that are defined as a50a128a126a44a129a126 a119 a26
a124a130a7a10a120a131a29a133a132a74a7a21a20a22a20a21a20a19a7a134a129a124a135a129a56 and a50a74a127a136a129a127a89a137a138a26a139a125a140a7a10a141a35a29a51a132a142a20a22a20a21a20a10a7a97a129a125a143a129a56 ,
respectively. We use the expression a126 a17a54a144 a126a74a145 a144 a126a102a146
to represent the path from a126a147a17 toa126 a146 througha126 a145 .
We define “attribute sequence” as a sequence of
attributes extracted from nodes included in a sub-
path. The attribute sequence is expressed as ‘A-B’
or ‘A-(C-B)’ where ( ) represents a chunk. As a ba-
sic example of the extraction of attribute sequences
from a sub-path, a127 a122 a144 a127a25a148 in Figure 2 contains the
four attribute sequences ‘e-b’, ‘e-V’, ‘N-b’ and ‘N-
V’, which are the combinations of all attributes in a127 a122
and a127a10a148 . Section 3.3 explains in detail the method of
extracting attribute sequences from sub-paths.
Next, we define “terminated nodes” as those that
do not contain any graph, such as a126 a122 , a126a108a149 ; “non-
terminated nodes” are those that do, such as a127 a17 , a127a25a145 .
Since HDAGs treat not only exact matching of
sub-structures but also approximate matching, we
allow node skips according to decay factor a150a33a3a10a151a153a152
a150a36a154a155a132a74a11 when extracting attribute sequences from the
sub-paths. This framework makes similarity evalua-
tion robust; the similar sub-structures can be eval-
uated in the value of similarity, in contrast to ex-
act matching that never evaluate the similar sub-
structure. Next, we define parameter a156 (a156 a29
a132a74a7a21a157a74a7a21a20a22a20a158a20) as the number of attributes combined in the
attribute sequence. When calculating similarity, we
consider only combination lengths of up toa156 .
Given the above discussion, the feature vector of
HDAG is written as a116a100a3 a121 a11a39a29a133a3a10a116 a17 a3 a121 a11a10a7a22a20a21a20a22a20a95a7a22a116a100a117a64a3 a121 a11a19a11 ,
where a116 represents the explicit feature mapping of
HDAG and a115 represents the number of all possible
a156 attribute combinations. The value of a116 a119 a3
a121
a11 is the
number of occurrences of thea120 ’th attribute sequence
in HDAG a121 ; each attribute sequence is weighted ac-
cording to the node skip. The similarity between
HDAGs, which is the definition of the HDAG Ker-
nel, follows equation (2) where input objects a5 and
a9 are
a121
a17 and
a121a123a122 , respectively. According to this ap-
proach, the HDAG Kernel calculates the inner prod-
uct of the common attribute sequences weighted ac-
cording to their node skips and the occurrence be-
tween the two HDAGs, a121 a17 and a121 a122 .
We note that, in general, if the dimension of the
feature space becomes very high or approaches in-
finity, it becomes computationally infeasible to gen-
erate feature vector a116a100a3 a121 a11 explicitly. To improve the
reader’s understanding of what the HDAG Kernel
calculates, before we introduce our efficient calcu-
lation method, the next section details the attribute
sequences that become elements of the feature vec-
tor if the calculation is explicit.
3.3 Attribute Sequences: The Elements of the
Feature Vector
We describe the details of the attribute sequences
that are elements of the feature vector of the HDAG
Kernel using a121 a17 and a121 a122 in Figure 2.
The framework of node skip
We denote the explicit representation of a node
skip by ”a159 ”. The attribute sequences in the sub-path
under the “node skip” are written as ‘a-a159 -c’. It costs
a150 to skip a terminated node. The cost of skipping a
Table 1: Attribute sequences and the values of nodes
a126a33a17 and a127 a145
a160 a96
sub-path a. seq. val.
a161
a75a143a162
a160 a96 NP 1
a160a106a163 a-
a164 a165
a160 a163 N-
a164 a165
a160a89a166 c-
a164 a165
a160a22a167
a164 -b a168a158a165
a161
a75
a168
a160a106a163a170a169a18a160a22a167 a-b 1
a160a106a163a170a169a18a160a22a167 N-b 1
a160 a166 a169a18a160 a167 c-b 1
a171 a167
sub-path a. seq. val.
a161
a75a143a162
a171 a167 NP 1
a171a158a172 (
a164 -a164 )-a a165
a163
a171a158a173 (c-
a164 )-a164 a165
a163
a171a158a173 (
a164 -d)-a164 a165
a163
a161
a75
a168
a171 a173 (c-d)-
a164 a165
a171 a173 a169a174a171a158a172 (c-
a164 )-a a165
a171 a173 a169a174a171a158a172 (
a164 -d)-a a165
a161
a75a143a175
a171a79a173a176a169a174a171 a172 (c-d)-c 1
non-terminated node is the same as skipping all the
graphs inside the non-terminated node. We intro-
duce decay functions a177a138a178a136a3a6a126a102a11 , a179a176a178a111a3a47a126a102a11 and a180a39a178a136a3a47a126a102a11 ; all
are based on decay factor a150 . a177a46a178a97a3a6a126a102a11 represents the
cost of node skip a126 . For example, a177 a178 a3a47a126a100a17a158a11a135a29a181a157a74a150
a122
represents the cost of node skip a126
a122
a144 a182 a145 and that
of a126a134a148 a144 a126a74a145 ; a177a131a178a136a3a47a126 a122 a11a135a29a181a150 is the cost of just node
skip a126 a122 . a179a12a178a136a3a47a126a102a11 represents the sum of the multiplied
cost of the node skips of all of the nodes that have a
path to a126 , a179a8a178a111a3a47a126a134a145a106a11a49a29a133a157a57a150 that is the sum cost of both
a126
a122 and
a126a134a148 that have a path to a126a136a145 , a179a12a178a111a3a47a126 a17 a11a34a29a183a132a57a3a19a150a111a184a89a11 .
a180 a178 a3a47a126a102a11 represents the sum of the multiplied cost of
the node skips of all the nodes that a126 has a path
to. a180a185a178a33a3a47a126 a122 a11a64a29a186a150 represents the cost of node skip
a126a108a145 where a126
a122 has a path to
a126a70a145 .
Attribute sequences for non-terminated nodes
We define the attributes of the non-terminated
node as the combinations of all attribute sequences
including the node skip. Table 1 shows the attribute
sequences and values of a126 a17 and a127a10a145 .
Details of the elements in the feature vector
The elements of the feature vector are not consid-
ered in any of the node skips. This means that ‘A-
a159 -B-C’ is the same element as ‘A-B-C’, and ‘A-a159 -a159 -
B-C’ and ‘A-a159 -B-a159 -C’ are also the same element as
‘A-B-C’. Considering the hierarchical structure, it is
natural to assume that ‘(N-a159 )-(d)-a’ and ‘(N-a159 )-((a159 -
d)-a)’ are different elements. However, in the frame-
work of the node skip and the attributes of the non-
terminated node, ‘(N-a159 )-(a159 )-a’ and ‘(N-a159 )-((a159 -a159 )-a)’
are treated as the same element. This framework
Table 2: Similarity values of a121 a17 and a121 a122 in Figure 2
a187
a96
a187
a163
att. seq. value att. seq. value
a161
a75a143a162
NP 1 NP 1 1
N 1 N 1 1
a 2 a 1 2
b 1 b 1 1
c 1 c 1 1
d 1 d 1 1
a161
a75
a168
(N-a164 )-(a164 )-a a165
a163
(N-a164 )-((a164 -a164 )-a) a165
a166
a165
a172
N-b 1 N-b 1 1
(N-a164 )-(d) a165 (N-a164 )-((a164 -d)-a164 ) a165
a166
a165
a167
(a164 -b)-(a164 )-a a168a158a165
a163
(a164 -b)-((a164 -a164 )-a) a165
a166
a168a158a165
a172
(a164 -b)-(d) a168a158a165 (a164 -b)-((a164 -d)-a164 ) a165
a166
a168a158a165
a167
(c-a164 )-(a164 )-a a165
a163
((c-a164 )-a) a165 a165
a166
(c-a164 )-(d) a165 c-d 1 a165
(d)-a 1 (c-a164 )-a a165 a165
a161
a75a143a175
(N-b)-(a164 )-a a165 (N-b)-((a164 -a164 )-a) a165
a163
a165
a166
(N-b)-(d) 1 (N-b)-((a164 -d)-a164 ) a165
a163
a165
a163
achieves approximate matching of the structure au-
tomatically, The HDAG Kernel judges all pairs of
attributes in each attribute sequence that are inside
or outside the same chunk. If all pairs of attributes
in the attribute sequences are in the same condition,
inside or outside the chunk, then the attribute se-
quences judge as the same element.
Table 2 shows the similarity, the values of
a2a34a188 a23a130a189a33a190 a3
a121
a17 a7
a121 a122
a11 , when the feature vectors are ex-
plicitly represented. We only show the common ele-
ments of each feature vector that appear in both a121 a17
and a121 a122 , since the number of elements that appear in
only a121 a17 or a121a123a122 becomes very large.
Note that, as shown in Table 2, the attribute se-
quences of the non-terminated node itself are not
addressed by the features of the graph. This is due
to the use of the hierarchical structure; the attribute
sequences of the non-terminated node come from
the combination of the attributes in the terminated
nodes. In the case of a182a57a17 , attribute sequence ‘N-a159 ’
comes from ‘N’ in a182 a122 . If we treat both ‘N-a159 ’ in a126a176a17
and ‘N’ ina126 a122 , we evaluate the attribute sequence ‘N’
in a126 a122 twice. That is why the similarity value in Ta-
ble 2 does not contain ‘c-a159 ’ ina126a142a17 and ‘(c-a159 )-a159 ’ in a127 a145 ,
see Table 1.
3.4 Calculation
First, we determine a191a70a192a54a3 a182 a7a16a193a79a11 , which returns the
sum of the common attribute sequences of the a194 -
combination of attributes between nodes a126 and a127 .
a195a82a196a197a66a55a198a19a69a98a199a25a73a76a75
a195a176a200a196 a66
a160
a69
a171
a73a97a201a131a202a158a203a21a204a55a66
a160
a69
a171
a73a55a69 if
a205
a75a35a162
a195 a200a196 a66
a160
a69
a171
a73a55a69 otherwise (3)
a195 a200a196 a66
a160
a69
a171
a73a102a75
a206 a69 if
a207
a161
a66
a160
a73a102a75a209a208 and
a207
a161
a66
a171
a73a102a75a209a208
a210a55a78
a113a212a211
a84a86a213a43a87
a214a136a215 a66a55a198a10a73a108a107a55a216 a215 a66a55a198a10a73a108a107a55a202a158a203a21a204a55a66a55a198a10a69
a171
a73a55a69
if a207
a161
a66
a160
a73a130a217a75a34a208 and
a207
a161
a66
a171
a73a100a75a34a208
a218
a78
a113a72a211
a84a86a213a55a87
a214a136a215 a66a68a199a25a73a108a107a95a216 a215 a66a68a199a16a73a108a107a95a202a158a203a21a204a55a66
a160
a69a98a199a25a73a55a69
if a207 a161 a66a160 a73a102a75a34a208 and a207 a161 a66a171 a73a142a217a75a34a208
a210a55a78
a113a212a211
a84a219a19a87
a218
a78
a113a72a211
a84a86a213a55a87
a216a220a215a158a66a55a198a10a73a97a107a55a216a39a215a158a66a68a199a25a73a108a107a95a221 a196 a66a55a198a19a69a98a199a25a73a47a69
otherwise
(4)
a222a100a223a97a224
a3a47a126a102a7a22a127a106a11 returns the number of common attributes
of nodes a126 and a127 , not including the attributes of
nodes insidea126 and a127 . We define functiona120a43a156a197a3a6a126a102a11 as re-
turning a set of nodes inside a non-terminated node
a126 . a120a43a156a197a3a6a126a102a11a225a29a181a226 means node a126 is a terminated node.
For example,a120a43a156a197a3a6a126a33a17a109a11a142a29a139a50a128a126 a122 a7a10a126 a148 a7a10a126 a145 a56 anda120a43a156a197a3a6a126 a122 a11a44a29a155a226 .
We define functions a227a123a192a46a3a6a126a102a7a21a127a57a11 , a227a185a228
a192
a3a47a126a102a7a22a127a106a11 and
a227 a228a228
a192 a3a47a126a102a7a22a127a106a11
to calculate a191a102a192a140a3a47a126a102a7a21a127a57a11 .
a221a185a196a8a66
a160
a69
a171
a73a102a75a35a195a82a196a197a66
a160
a69
a171
a73a97a201
a196a8a229
a96
a230
a94a134a96
a221 a200
a230
a66
a160
a69
a171
a73a97a107a79a195a82a196a8a229
a230
a66
a160
a69
a171
a73 (5)
a221 a200a196 a66
a160
a69
a171
a73a102a75
a218
a78a128a231a95a232a47a233a234a84a86a213a43a87
a235 a215 a66a234a199a16a73a108a107a55a221 a200a196 a66
a160
a69a98a199a25a73a74a201a140a221 a200a200a196 a66
a160
a69a98a199a25a73 (6)
a221 a200a200a196 a66
a160
a69
a171
a73a76a75
a210a55a78a128a231a95a232a47a233a234a84a219a10a87
a235a42a215a22a66a105a198a10a73a108a107a55a221 a200a200a196 a66a55a198a19a69
a171
a73a97a201a54a221 a196 a66a55a198a19a69
a171
a73 (7)
The boundary conditions are
a221a185a196a8a66
a160
a69
a171
a73a236a75 a214a136a215 a66
a160
a73a111a107 a214a136a215 a66
a171
a73a97a107a114a195a89a196a197a66
a160
a69
a171
a73a55a69 if
a205
a75a35a162 (8)
a221 a200a196 a66
a160
a69
a171
a73a236a75 a206 a69 if
a237a109a238
a204a55a66
a171
a73a76a75a209a208 (9)
a221 a200a200a196 a66
a160
a69
a171
a73a236a75 a206 a69 if
a237a109a238
a204a55a66
a160
a73a76a75a209a208a57a99 (10)
Function a239a70a240
a224
a3a6a126a102a11 returns the set of nodes that have
direct links to node a126 . a239a70a240 a224 a3a47a126a102a11a49a29a241a226 means no nodes
have direct links to a182 . a239a70a240 a224 a3a47a126a33a145a106a11a242a29 a50a128a126 a122 a7a16a126a134a148a106a56 and
a239a70a240
a224
a3a6a126a170a17a109a11a44a29a155a226 .
Next, we define a2a64a3a6a126a102a7a21a127a57a11 as representing the sum
of the common attribute sequences that are the a194 -
combinations of attributes extracted from the sub-
paths whose sinks area126 and a127 , respectively.
a65a46a196a44a66
a160
a69
a171
a73a76a75
a202a158a203a21a204a55a66
a160
a69
a171
a73a55a69 if
a205
a75a24a162
a196a8a229
a96
a230
a94a97a96a76a243
a200
a230
a66
a160
a69
a171
a73a108a107a158a195 a196a147a229
a230
a66
a160
a69
a171
a73a25a69 otherwise
(11)
Functions a244a142a192a140a3a47a126a102a7a21a127a57a11 , a244 a228
a192
a3a47a126a102a7a22a127a106a11 and a244 a228a228
a192
a3a6a126a102a7a21a127a57a11 ,
needed for the recursive calculation of a2 a192 a3a6a126a102a7a21a127a57a11 , are
written in the same form as a227a39a192a34a3a47a126a102a7a22a127a106a11 , a227 a228
a192
a3a47a126a102a7a22a127a106a11 and
a227 a228a228
a192
a3a47a126a102a7a22a127a106a11 respectively, except for the boundary con-
dition of a244 a192 a3a47a126a102a7a22a127a106a11 , which is written as:
a243
a196 a66
a160
a69
a171
a73a236a75 a195 a196 a66
a160
a69
a171
a73a55a69 if
a205
a75a4a162a158a99 (12)
Finally, an efficient similarity calculation formula is
written as
a65a138a245
a92a136a246a108a247
a66 a187
a96
a69 a187
a163
a73a76a75
a211
a196
a94a134a96
a219a109a78a79a248 a213a95a78a79a249
a65a46a196a44a66
a160
a69
a171
a73a55a99 (13)
According to equation (13), given the recursive
definition of a2a36a192a46a3a47a126a102a7a22a127a106a11 , the similarity between two
HDAGs can be calculated in a250a140a3a47a156a42a129a124a135a129a101a129a125a143a129a11 time1.
3.5 Efficient Calculation Method
We will now elucidate an efficient processing algo-
rithm. First, as a pre-process, the nodes are sorted
under the following condition: all nodes that have
a path to the focused node and are in the graph in-
side the focused node should be set before the fo-
cused node. We can get at least one set of ordered
nodes since we are treating an HDAG. In the case
of a121 a17 , we can get a251a212a126 a122 , a126a74a148 , a126a74a145 , a126 a17 , a126 a149 , a126a108a252 , a126a33a146a158a253 . We
can rewrite the recursive calculation formula in “for
loops”, if we follow the sorted order. Figure 3 shows
the algorithm of the HDAG kernel. Dynamic pro-
gramming technique is used to compute the HDAG
Kernel very efficiently because when following the
sorted order, the values that are needed to calculate
the focused pair of nodes are already calculated in
the previous calculation. We can calculate the table
by following the order of the nodes from left to right
and top to bottom.
We normalize the computed kernels before their
use within the algorithms. The normalization cor-
responds to the standard unit norm normalization of
1We can easily rewrite the equation to calculate all combi-
nations of attributes, but the order of calculation time becomes
a254 a66a105a255a0a39a255a86a255a2a1a197a255a91a73 .
Algorithm HDAG Kernel n combination
for a3a5a4a7a6a9a8a11a10a12a4a14a13a16a15a17a18a15 a10 i++a19
for a3a5a20a14a6a9a8a11a10a21a20a22a13a23a15a24a25a15a26a10a21a20 ++a19
a27
a83a29a28a30a32a31a29a33a35a34a37a36a39a38 a6a16a40 a83 a3 a30a41a31a42a33a43a34a29a36 a19a44a6a16a45a47a46a47a48a21a3 a30a32a31a49a33a43a34a29a36 a19
if a4a43a50a51a3 a30 a31 a19a53a52a6a55a54 and a4a43a50a14a3 a34 a36a56a19a57a52a6a55a54
foreach a58a22a59a60a4a43a50a51a3 a30 a31a19
foreach a61a62a59a63a4a43a50a14a3 a34 a36a56a19
fora3a5a64a9a6a9a8a11a10a21a64a65a13a66a50a51a10a67a64 ++a19
a40a44a68a18a3 a30 a31 a33a35a34 a36a39a19 +=a69a22a70a41a3a5a58a71a19a72a69a18a70a73a3a5a61a43a19a72a74a75a3a5a58 a33 a61a43a19
end
end
end
else if a4a12a50a14a3 a30 a31a5a19a57a52a6a55a54
foreach a58a22a59a60a4a43a50a51a3 a30a7a31a19
a40 a83 a3 a30a32a31a71a33a67a34a29a36 a19 +=a69 a70 a3a2a58a42a19a72a76 a70 a3a5a58a71a19a72a45a77a46a47a48a21a3a5a58 a33a43a34a11a36 a19
end
else if a4a12a50a14a3 a34a11a36 a19a53a52a6a55a54
foreach a61a62a59a63a4a43a50a51a3 a34a79a78 a19
a40a74a83a29a3 a30 a31 a33a67a34 a36a56a19 +=a69a22a70a80a3a5a61a43a19a72a76a14a70a47a3a2a61a43a19a72a45a77a46a47a48a21a3 a30 a31 a33 a61a43a19
end
end
foreach a58a22a59a63a81a11a82a49a48a56a3 a30 a31a43a19
for a3a5a64a9a6a83a8a37a10a21a64a84a13a66a50a14a10a21a64 ++a19
a85a87a86 a86
a68
a28a30a32a31a29a33a67a34a29a36a56a38 +=a88 a70 a3a5a58a71a19
a85a87a86 a86
a68
a28a58 a33a67a34a29a36a39a38a77a89
a85
a68a18a28a58 a33a67a34a29a36a56a38
a74
a86a86
a68
a28a30 a31 a33a67a34 a36 a38 +=a88a75a70a73a3a5a58a42a19a72a74
a86a86
a68
a28a58 a33a35a34 a36 a38a77a89 a74a90a68 a28a58 a33a67a34 a36 a38
end
end
foreach a61a91a59a92a81a37a82a29a48a21a3 a34 a36a49a19
for a3a5a64a9a6a83a8a37a10a21a64a84a13a66a50a14a10a21a64 ++a19
a85a87a86
a68
a28a30 a31 a33a67a34 a36 a38 +=a88a75a70a41a3a5a61a43a19
a85a87a86
a68
a28a30 a31 a33 a61 a38a77a89
a85a53a86a86
a68
a28a30 a31 a33 a61 a38
a74
a86
a68
a28a30a32a31a42a33a67a34a29a36a56a38 +=a88 a70 a3a5a61a43a19a72a74
a86
a68
a28a30a47a31a29a33 a61 a38a11a89 a74
a86a86
a68
a28a30a32a31a42a33 a61 a38
end
end
a85
a83 a28a30 a31 a33a43a34 a36 a38 a6a16a40a106a83 a28a30 a31 a33a67a34 a36 a38
a74a170a83 a28a30 a31 a33a43a34 a36 a38 a6a93a76a94a70a73a3 a30 a31a5a19a96a95a39a76a97a70a73a3 a34 a36a42a19a51a95a71a40a108a83 a28a30 a31 a33a67a34 a36 a38
for a3a2a64a98a6a55a99a80a10a21a64a65a13a66a50a51a10a67a64 ++a19
a85
a68 a28a30 a31 a33a67a34 a36 a38 a6a16a74a90a68 a28a30 a31 a33a67a34 a36 a38 a6a16a40a7a68 a28a30 a31 a33a67a34 a36 a38
for a3a5a46a53a6a9a8a11a10a21a46a101a100a102a64a102a10a39a46 ++a19
a27
a68 a28a30 a31 a33a67a34 a36 a38 +=
a85 a86
a103 a28a30 a31 a33a67a34 a36 a38 a95a49a40a44a68a97a81 a103 a28a30 a31 a33a67a34 a36 a38
a85
a68a18a28a30a32a31a29a33a67a34a29a36a56a38 +=
a85 a86
a103 a28a30a32a31a29a33a67a34a29a36a56a38 a95a49a40 a68a97a81
a103
a28a30a47a31a29a33a67a34a29a36a56a38
a74a87a68 a28a30 a31 a33a67a34 a36 a38 +=a74
a86
a103 a28a30 a31 a33a67a34 a36 a38 a95a39a40a44a68a97a81
a103
a28a30 a31 a33a67a34 a36 a38
end
end
end
end
return a104
a68a51a105a111a83 a106
a31a43a107a44a108
a78
a36a42a107a44a109
a27
a68a92a28a30a32a31a42a33a67a34a37a36a56a38
Figure 3: Algorithm of the HDAG Kernel
examples in the feature space corresponding to the
kernel space (Lodhi et al., 2002).
a110
a65a225a66a68a67a70a69a98a71a74a73a70a75
a65a36a66a68a67a70a69a101a71a57a73
a65a135a66a234a67a136a69a68a67a70a73a97a107a55a65a36a66a68a71a108a69a68a71a74a73
(14)
4 Experiments
We evaluated the performance of the proposed
method in an actual application of NLP; the data set
is written in Japanese.
We compared HDAG and DAG (the latter had no
hierarchy structure) to the String Subsequence Ker-
nel (SSK) for word sequence, Dependency Structure
p1
p2
p5p4
p3 p6 p7
George Bush purchased a small interest in which baseball team ?
NNP NNP VBD DT JJ NN IN WDT  NN  NN .
PERSON
NP
NP NPPP
Question:  George Bush purchased a small interest in which baseball team ?
p8
p9
p11
p10
p12 p13 p14
p1 p5p4 p6 p7
George Bush purchased a small interest in which baseball team ?
VBD DT JJ NN IN WDT  NN  NN .PERSON
p8 p9 p10
(a) Hierarchical and Dependency Structure 
(b) Dependency Structure 
p2 p3
(c) Word Order
p1 p5p4 p6 p7
George Bush purchased a small interest in which baseball team ?
VBD DT JJ NN IN WDT  NN  NN .PERSON
p8 p9 p10p2 p3
Figure 4: Examples of Input Object Structure: (a)
HDAG, (b) DAG and DSK’, (c) SSK’
Kernel (DSK) (Collins and Duffy, 2001) (a special
case of the Tree Kernel), and Cosine measure for
feature vectors consisting of the occurrence of at-
tributes (BOA), and the same as BOA, but only the
attributes of noun and unknown word (BOA’)were
used.
We expanded SSK and DSK to improve the total
performance of the experiments. We denote them
as SSK’ and DSK’ respectively. The original SSK
treats only exacta156 string combinations based on pa-
rameter a156 . We consider string combinations of up to
a156 for SSK’. The original DSK was specifically con-
structed for parse tree use. We expanded it to be able
to treat thea156 combinations of nodes and the free or-
der of child node matching.
Figure 4 shows some input objects for each eval-
uated kernel, (a) for HDAG, (b) for DAG and DSK’,
and (c) for SSK’. Note, though DAG and DSK’
treat the same input objects, their kernel calculation
methods differ as do the return values.
We used the words and semantic information of
“Goi-taikei” (Ikehara et al., 1997), which is similar
to WordNet in English, as the attributes of the node.
The chunks and their relations in the texts were an-
alyzed by cabocha (Kudo and Matsumoto, 2002),
and named entities were analyzed by the method
of (Isozaki and Kazawa, 2002).
We tested eacha156 -combination case with changing
parameter a150 from 0.1 through 0.9 in the step of 0.1.
Only the best performance achieved under parame-
ter a150 is shown in each case.
Table 3: Results of the performance as a similarity
measure for question classification
a161 1 2 3 4 5 6
HDAG - .580 .583 .580 .579 .573
DAG - .577 .578 .573 .573 .563
DSK’ - .547 .469 .441 .436 .436
SSK’ - .568 .572 .570 .562 .548
BOA .556
BOA’ .555
4.1 Performance as a Similarity Measure
Question Classification
We used the 1011 questions of NTCIR-QAC1 2
and the 2000 questions of CRL-QA data 3 We as-
signed them into 148 question types based on the
CRL-QA data.
We evaluated classification performance in the
following step. First, we extracted one question
from the data. Second, we calculated the similar-
ity between the extracted question and all the other
questions. Third, we ranked the questions in order of
descending similarity. Finally, we evaluated perfor-
mance as a similarity measure by Mean Reciprocal
Rank (MRR) (Voorhees and Tice, 1999) based on
the question type of the ranked questions.
Table 3 shows the results of this experiment.
Sentence Alignment
The data set (Hirao et al., 2003) taken from the
“Mainichi Shinbun”, was formed into abstract sen-
tences and manually aligned to sentences in the
“Yomiuri Shinbun” according to the meaning of sen-
tence (did they say the same thing).
This experiment was prosecuted as follows.
First, we extracted one abstract sentence from the
“Mainichi Shinbun” data-set. Second, we calculated
the similarity between the extracted sentence and the
sentences in the “Yomiuri Shinbun” data-set. Third,
we ranked the sentences in the “Yomiuri Shinbun”
in descending order based on the calculated similar-
ity values. Finally, we evaluated performance as a
similarity measure using the MRR measure.
Table 4 shows the results of this experiment.
2http://www.nlp.cs.ritsumei.ac.jp/qac/
3http://www.cs.nyu.edu/˜sekine/PROJECT/CRLQA/
Table 4: Results of the performance as a similarity
measure for sentence alignment
a161 1 2 3 4 5 6
HDAG - .523 .484 .467 .442 .423
DAG - .503 .478 .461 .439 .420
DSK’ - .174 .083 .035 .020 .021
SSK’ - .479 .444 .422 .412 .398
BOA .394
BOA’ .451
Table 5: Results of question classification by SVM
with comparison kernel functions
a161 1 2 3 4 5 6
HDAG - .862 .865 .866 .864 .865
DAG - .862 .862 .847 .818 .751
DSK’ - .731 .595 .473 .412 .390
SSK’ - .850 .847 .825 .777 .725
BOA+poly .810 .823 .800 .753 .692 .625
BOA’+poly .807 .807 .742 .666 .558 .468
4.2 Performance as a Kernel Function
Question Classification
The comparison methods were evaluated the per-
formance as a kernel function in the machine learn-
ing approach of the Question Classification. We
chose SVM as a kernel-based learning algorithm
that produces state-of-the-art performance in several
NLP tasks.
We used the same data set as used in the previous
experiments with the following difference: if a ques-
tion type had fewer than ten questions, we moved
the entries into the upper question type as defined
in CRL-QA data to provide enough training sam-
ples for each question type. We used one-vs-rest
as the multi-class classification method and found
a highest scoring question type. In the case of BOA
and BOA’, we used the polynomial kernel (Vapnik,
1995) to consider the attribute combinations.
Table 5 shows the average accuracy of each ques-
tion as evaluated by 5-fold cross validation.
5 Discussion
The experiments in this paper were designed to eval-
uated how the similarity measure reflects the seman-
tic information of texts. In the task of Question Clas-
sification, a given question is classified into Ques-
tion Type, which reflects the intention of the ques-
tion. The Sentence Alignment task evaluates which
sentence is the most semantically similar to a given
sentence.
The HDAG Kernel showed the best performance
in the experiments as a similarity measure and as
a kernel of the learning algorithm. This proves the
usefulness of the HDAG Kernel in determining the
similarity measure of texts and in providing an SVM
kernel for resolving classification problems in NLP
tasks. These results indicate that our approach, in-
corporating richer structures within texts, is well
suited to the tasks that require evaluation of the se-
mantical similarity between texts. The potential use
of the HDAG Kernel is very wider in NLP tasks, and
we believe it will be adopted in other practical NLP
applications such as Text Categorization and Ques-
tion Answering.
Our experiments indicate that the optimal param-
eters of combination number a156 and decay factor a150
depend the task at hand. They can be determined by
experiments.
The original DSK requires exact matching of the
tree structure, even when expanded (DSK’) for flex-
ible matching. This is why DSK’ showed the worst
performance. Moreover, in Sentence Alignment
task, paraphrasing or different expressions with the
same meaning is common, and the structures of the
parse tree widely differ in general. Unlike DSK’,
SSK’ and HDAG Kernel offer approximate match-
ing which produces better performance.
The structure of HDAG approaches that of DAG,
if we do not consider the hierarchical structure. In
addition, the structure of sequences (strings) is en-
tirely included in that of DAG. Thus, the framework
of the HDAG Kernel covers DAG Kernel and SSK.
6 Conclusion
This paper proposed the HDAG Kernel, which can
reflect the richer information present within texts.
Our proposed method is a very generalized frame-
work for handling the structure inside a text.
We evaluated the performance of the HDAG Ker-
nel both as a similarity measure and as a kernel func-
tion. Our experiments showed that HDAG Kernel
offers better performance than SSK, DSK, and the
baseline method of the Cosine measure for feature
vectors, because HDAG Kernel better utilizes the
richer structure present within texts.
References
M. Collins and N. Duffy. 2001. Parsing with a Single
Neuron: Convolution Kernels for Natural Language
Problems. In Technical Report UCS-CRL-01-10. UC
Santa Cruz.
N. Cristianini and J. Shawe-Taylor. 2000. An In-
troduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge Univer-
sity Press.
D. Haussler. 1999. Convolution Kernels on Discrete
Structures. In Technical Report UCS-CRL-99-10. UC
Santa Cruz.
T. Hirao, H. Kazawa, H. Isozaki, E. Maeda, and Y. Mat-
sumoto. 2003. Machine Learning Approach to Multi-
Document Summarization. Journal of Natural Lan-
guage Processing, 10(1):81–108. (in Japanese).
S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo,
H. Nakaiwa, K. Ogura, Y. Oyama, and Y. Hayashi,
editors. 1997. The Semantic Attribute System, Goi-
Taikei — A Japanese Lexicon, volume 1. Iwanami
Publishing. (in Japanese).
H. Isozaki and H. Kazawa. 2002. Efficient Support
Vector Classifiers for Named Entity Recognition. In
Proc. of the 19th International Conference on Compu-
tational Linguistics (COLING 2002), pages 390–396.
T. Kudo and Y. Matsumoto. 2002. Japanese Depen-
dency Analysis using Cascaded Chunking. In Proc.
of the 6th Conference on Natural Language Learning
(CoNLL 2002), pages 63–69.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini,
and C. Watkins. 2002. Text Classification Using
String Kernel. Journal of Machine Learning Research,
2:419–444.
G. Salton, A. Wong, and C. Yang. 1975. A Vector Space
Model for Automatic Indexing. Communication of the
ACM, 11(18):613–620.
V. N. Vapnik. 1995. The Nature of Statistical Learning
Theory. Springer.
E. M. Voorhees and D. M. Tice. 1999. The TREC-8
Question Answering Track Evaluation. Proc. of the
8th Text Retrieval Conference (TREC-8).
