Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 97–100,
New York, June 2006. c©2006 Association for Computational Linguistics
Syntactic Kernels for Natural Language Learning:
the Semantic Role Labeling Case
Alessandro Moschitti
Department of Computer Science
University of Rome ”Tor Vergata”
Rome, Italy
moschitti@info.uniroma2.it
Abstract
In this paper, we use tree kernels to exploit
deep syntactic parsing information for nat-
ural language applications. We study the
properties of different kernels and we pro-
vide algorithms for their computation in
linear average time. The experiments with
SVMs on the task of predicate argument
classification provide empirical data that
validates our methods.
1 Introduction
Recently, several tree kernels have been applied to
natural language learning, e.g. (Collins and Duffy,
2002; Zelenko et al., 2003; Cumby and Roth, 2003;
Culotta and Sorensen, 2004; Moschitti, 2004). De-
spite their promising results, three general objec-
tions against kernel methods are raised: (1) only a
subset of the dual space features are relevant, thus,
it may be possible to design features in the primal
space that produce the same accuracy with a faster
computation time; (2) in some cases the high num-
ber of features (substructures) of the dual space can
produce overfitting with a consequent accuracy de-
crease (Cumby and Roth, 2003); and (3) the compu-
tation time of kernel functions may be too high and
prevent their application in real scenarios.
In this paper, we study the impact of the sub-
tree (ST) (Vishwanathan and Smola, 2002), subset
tree (SST) (Collins and Duffy, 2002) and partial tree
(PT) kernels on Semantic Role Labeling (SRL). The
PT kernel is a new function that we have designed
to generate larger substructure spaces. Moreover,
to solve the computation problems, we propose al-
gorithms which evaluate the above kernels in linear
average running time.
We experimented such kernels with Support Vec-
tor Machines (SVMs) on the classification of seman-
tic roles of PropBank (Kingsbury and Palmer, 2002)
and FrameNet (Fillmore, 1982) data sets. The re-
sults show that: (1) the kernel approach provides the
same accuracy of the manually designed features.
(2) The overfitting problem does not occur although
the richer space of PTs does not provide better ac-
curacy than the one based on SST. (3) The average
running time of our tree kernel computation is linear.
In the remainder of this paper, Section 2 intro-
duces the different tree kernel spaces. Section 3 de-
scribes the kernel functions and our fast algorithms
for their evaluation. Section 4 shows the compara-
tive performance in terms of execution time and ac-
curacy.
2 Tree kernel Spaces
We consider three different tree kernel spaces: the
subtrees (STs), the subset trees (SSTs) and the novel
partial trees (PTs).
An ST of a tree is rooted in any node and includes
all its descendants. For example, Figure 1 shows the
parse tree of the sentence "Mary brought a cat"
together with its 6 STs. An SST is a more general
structure since its leaves can be associated with non-
terminal symbols. The SSTs satisfy the constraint
that grammatical rules cannot be broken. For exam-
ple, Figure 2 shows 10 SSTs out of 17 of the sub-
tree of Figure 1 rooted in VP. If we relax the non-
breaking rule constraint we obtain a more general
form of substructures, i.e. the PTs. For example,
97
Figure 3 shows 10 out of the total 30 PTs, derived
from the same tree as before.
S 
N 
NP 
D N 
VP 
V Mary 
brought 
a    cat 
NP 
D N 
a    cat 
N 
   cat 
D 
a 
V 
brought 
N 
Mary 
NP 
D N 
VP 
V 
brought 
a    cat 
Figure 1: A syntactic parse tree with its subtrees (STs).
NP 
D N 
a   cat 
NP 
D N 
NP 
D N 
a 
NP 
D N NP 
D N 
VP 
V 
brought 
a    cat 
  cat NP D N 
VP 
V 
a    cat 
NP 
D N 
VP 
V 
N 
   cat 
D 
a 
V 
brought 
N 
Mary … 
Figure 2: A tree with some of its subset trees (SSTs).
Figure 3: A tree with some of its partial trees (PTs).
3 Fast Tree Kernel Functions
The main idea of tree kernels is to compute the
number of common substructures between two trees
T1 and T2 without explicitly considering the whole
fragment space. We designed a general function
to compute the ST, SST and PT kernels. Our fast
algorithm is inspired by the efficient evaluation of
non-continuous subsequences (described in (Shawe-
Taylor and Cristianini, 2004)). To further increase
the computation speed, we also applied the pre-
selection of node pairs which have non-null kernel.
3.1 Generalized Tree Kernel function
Given a tree fragment space F = {f1,f2,..,fF}, we
use the indicator function Ii(n) which is equal to 1 if
the target fi is rooted at node n and 0 otherwise. We
define the general kernel as:
K(T1,T2) =
summationdisplay
n1∈NT1
summationdisplay
n2∈NT2
∆(n1,n2), (1)
where NT1 and NT2 are the sets of nodes in T1 and
T2, respectively and ∆(n1,n2) = summationtext|F|i=1 Ii(n1)Ii(n2),
i.e. the number of common fragments rooted at the
n1 and n2 nodes. We can compute it as follows:
- if the node labels of n1 and n2 are different then
∆(n1,n2) = 0;
- else:
∆(n1,n2) = 1+
summationdisplay
vectorJ1,vectorJ2,l(vectorJ1)=l(vectorJ2)
l(vectorJ1)productdisplay
i=1
∆(cn1[vectorJ1i],cn2[vectorJ2i])
(2)
where vectorJ1 = 〈J11,J12,J13,..〉 and vectorJ2 = 〈J21,J22,J23,..〉
are index sequences associated with the ordered
child sequences cn1 of n1 and cn2 of n2, respectively,
vectorJ1i and vectorJ2i point to the i-th children in the two se-
quences, and l(·) returns the sequence length. We
note that (1) Eq. 2 is a convolution kernel accord-
ing to the definition and the proof given in (Haus-
sler, 1999). (2) Such kernel generates a feature
space richer than those defined in (Vishwanathan
and Smola, 2002; Collins and Duffy, 2002; Zelenko
et al., 2003; Culotta and Sorensen, 2004; Shawe-
Taylor and Cristianini, 2004). Additionally, we add
the decay factor as follows: ∆(n1,n2) =
µ
parenleftBig
λ2+
summationdisplay
vectorJ1,vectorJ2,l(vectorJ1)=l(vectorJ2)
λd(vectorJ1)+d(vectorJ2)
l(vectorJ1)productdisplay
i=1
∆(cn1[vectorJ1i],cn2[vectorJ2i])
parenrightBig
(3)
where d(vectorJ1) = vectorJ1l(vectorJ1) − vectorJ11 and d(vectorJ2) = vectorJ2l(vectorJ2) − vectorJ21.
In this way, we penalize subtrees built on child
subsequences that contain gaps. Moreover, to
have a similarity score between 0 and 1, we also
apply the normalization in the kernel space, i.e.
Kprime(T1,T2) = K(T1,T2)√K(T
1,T1)×K(T2,T2)
. As the summation
in Eq. 3 can be distributed with respect to different
types of sequences, e.g. those composed by p
children, it follows that
∆(n1,n2) = µ
parenleftbig
λ2 +summationtextlmp=1 ∆p(n1,n2)
parenrightbig
, (4)
where ∆p evaluates the number of common subtrees
rooted in subsequences of exactly p children (of n1
and n2) and lm = min{l(cn1),l(cn2)}. Note also that if
we consider only the contribution of the longest se-
quence of node pairs that have the same children, we
implement the SST kernel. For the STs computation
we need also to remove the λ2 term from Eq. 4.
Given the two child sequences c1a = cn1 and
c2b = cn2 (a and b are the last children), ∆p(c1a,c2b) =
∆(a,b)×
|c1|summationdisplay
i=1
|c2|summationdisplay
r=1
λ|c1|−i+|c2|−r ×∆p−1(c1[1 : i],c2[1 : r]),
where c1[1 : i] and c2[1 : r] are the child subse-
quences from 1 to i and from 1 to r of c1 and c2. If
we name the double summation term as Dp, we can
rewrite the relation as:
98
∆p(c1a,c2b) =
braceleftBigg ∆(a,b)D
p(|c1|,|c2|) if a = b;
0 otherwise.
Note that Dp satisfies the recursive relation:
Dp(k,l) = ∆p−1(s[1 : k],t[1 : l])+λDp(k,l−1)
+λDp(k−1,l)+λ2Dp(k−1,l−1).
By means of the above relation, we can compute
the child subsequences of two sets c1 and c2 in
O(p|c1||c2|). This means that the worst case com-
plexity of the PT kernel is O(pρ2|NT1||NT2|), where
ρ is the maximum branching factor of the two trees.
Note that the average ρ in natural language parse
trees is very small and the overall complexity can be
reduced by avoiding the computation of node pairs
with different labels. The next section shows our fast
algorithm to find non-null node pairs.
3.2 Fast non-null node pair computation
To compute the kernels defined in the previous sec-
tion, we sum the ∆ function for each pair 〈n1,n2〉∈
NT1 × NT2 (Eq. 1). When the labels associated
with n1 and n2 are different, we can avoid evaluating
∆(n1,n2) since it is 0. Thus, we look for a node pair
set Np ={〈n1,n2〉∈ NT1 ×NT2 : label(n1) = label(n2)}.
To efficiently build Np, we (i) extract the L1 and
L2 lists of nodes from T1 and T2, (ii) sort them in
alphanumeric order and (iii) scan them to find Np.
Step (iii) may require only O(|NT1|+|NT2|) time, but,
if label(n1) appears r1 times in T1 and label(n2) is re-
peated r2 times in T2, we need to consider r1 × r2
pairs. The formal can be found in (Moschitti, 2006).
4 The Experiments
In these experiments, we study tree kernel perfor-
mance in terms of average running time and accu-
racy on the classification of predicate arguments. As
shown in (Moschitti, 2004), we can label seman-
tic roles by classifying the smallest subtree that in-
cludes the predicate with one of its arguments, i.e.
the so called PAF structure.
The experiments were carried out with
the SVM-light-TK software available at
http://ai-nlp.info.uniroma2.it/moschitti/
which encodes the fast tree kernels in the SVM-light
software (Joachims, 1999). The multiclassifiers
were obtained by training an SVM for each class
in the ONE-vs.-ALL fashion. In the testing phase,
we selected the class associated with the maximum
SVM score.
For the ST, SST and PT kernels, we found that the
best λ values (see Section 3) on the development set
were 1, 0.4 and 0.8, respectively, whereas the best µ
was 0.4.
4.1 Kernel running time experiments
To study the FTK running time, we extracted from
the Penn Treebank several samples of 500 trees con-
taining exactly n nodes. Each point of Figure 4
shows the average computation time1 of the kernel
function applied to the 250,000 pairs of trees of size
n. It clearly appears that the FTK-SST and FTK-PT
(i.e. FTK applied to the SST and PT kernels) av-
erage running time has linear behavior whereas, as
expected, the na¨ıve SST algorithm shows a quadratic
curve.
Figure 4: Average time in µseconds for the na¨ıve SST kernel,
FTK-SST and FTK-PT evaluations.
4.2 Experiments on SRL dataset
We used two different corpora: PropBank
(www.cis.upenn.edu/∼ace) along with Penn
Treebank 2 (Marcus et al., 1993) and FrameNet.
PropBank contains about 53,700 sentences and
a fixed split between training and testing used in
other researches. In this split, sections from 02 to
21 are used for training, section 23 for testing and
section 22 as development set. We considered a
total of 122,774 and 7,359 arguments (from Arg0
to Arg5, ArgA and ArgM) in training and testing,
respectively. The tree structures were extracted
from the Penn Treebank.
From the FrameNet corpus (www.icsi.
berkeley.edu/∼framenet) we extracted all
1We run the experiments on a Pentium 4, 2GHz, with 1 Gb
ram.
99
Figure 5: Multiclassifier accuracy according to different train-
ing set percentage.
24,558 sentences of the 40 Frames selected for
the Automatic Labeling of Semantic Roles task of
Senseval 3 (www.senseval.org). We considered
the 18 most frequent roles, for a total of 37,948
examples (30% of the sentences for testing and
70% for training/validation). The sentences were
processed with the Collins’ parser (Collins, 1997)
to generate automatic parse trees.
We run ST, SST and PT kernels along with
the linear kernel of standard features (Carreras and
M`arquez, 2005) on PropBank training sets of dif-
ferent size. Figure 5 illustrates the learning curves
associated with the above kernels for the SVM mul-
ticlassifiers.
The tables 1 and 2 report the results, using all
available training data, on PropBank and FrameNet
test sets, respectively. We note that: (1) the accu-
racy of PTs is almost equal to the one produced by
SSTs as the PT space is a hyperset of SSTs. The
small difference is due to the poor relevance of the
substructures in the PT − SST set, which degrade
the PT space. (2) The high F1 measures of tree ker-
nels on FrameNet suggest that they are robust with
respect to automatic parse trees.
Moreover, the learning time of SVMs using FTK
for the classification of one large argument (Arg 0)
is much lower than the one required by na¨ıve algo-
rithm. With all the training data FTK terminated in
6 hours whereas the na¨ıve approach required more
than 1 week. However, the complexity burden of
working in the dual space can be alleviated with re-
cent approaches proposed in (Kudo and Matsumoto,
2003; Suzuki et al., 2004).
Finally, we carried out some experiments with the
combination between linear and tree kernels and we
found that tree kernels improve the models based on
manually designed features by 2/3 percent points,
thus they can be seen as a useful tactic to boost sys-
tem accuracy.
Args Linear ST SST PT
Acc. 87.6 84.6 87.7 86.7
Table 1: Evaluation of kernels on PropBank data and gold
parse trees.
Roles Linear ST SST PT
Acc. 82.3 80.0 81.2 79.9
Table 2: Evaluation of kernels on FrameNet data encoded in
automatic parse trees.
References
Xavier Carreras and Llu´ıs M`arquez. 2005. Introduction to the
CoNLL-2005 shared task: Semantic role labeling. In Pro-
ceedings of CoNLL05.
Michael Collins and Nigel Duffy. 2002. New ranking algo-
rithms for parsing and tagging: Kernels over discrete struc-
tures, and the voted perceptron. In ACL02.
Michael Collins. 1997. Three generative, lexicalized models
for statistical parsing. In Proceedings of the ACL97.
Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree
kernels for relation extraction. In Proceedings of ACL04.
Chad Cumby and Dan Roth. 2003. Kernel methods for rela-
tional learning. In Proceedings of ICML03.
Charles J. Fillmore. 1982. Frame semantics. In Linguistics in
the Morning Calm.
D. Haussler. 1999. Convolution kernels on discrete struc-
tures. Technical report ucs-crl-99-10, University of Califor-
nia Santa Cruz.
T. Joachims. 1999. Making large-scale SVM learning practical.
In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances
in Kernel Methods - Support Vector Learning.
Paul Kingsbury and Martha Palmer. 2002. From Treebank to
PropBank. In Proceedings of LREC02.
Taku Kudo and Yuji Matsumoto. 2003. Fast methods for
kernel-based text analysis. In Proceedings of ACL03.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993.
Building a large annotated corpus of english: The Penn Tree-
bank. Computational Linguistics.
Alessandro Moschitti. 2004. A study on convolution kernels
for shallow semantic parsing. In proceedings of ACL04.
Alessandro Moschitti. 2006. Making tree kernels practical for
natural language learning. In Proceedings of EACL06.
John Shawe-Taylor and Nello Cristianini. 2004. Kernel Meth-
ods for Pattern Analysis. Cambridge University Press.
Jun Suzuki, Hideki Isozaki, and Eisaku Maeda. 2004. Con-
volution kernels with feature selection for natural language
processing tasks. In Proceedings of ACL04.
S.V.N. Vishwanathan and A.J. Smola. 2002. Fast kernels on
strings and trees. In Proceedings of NIPS02.
D. Zelenko, C. Aone, and A. Richardella. 2003. Kernel meth-
ods for relation extraction. JMLR.
100
