Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 825–832,
Sydney, July 2006. c©2006 Association for Computational Linguistics
A Composite Kernel to Extract Relations between Entities with 
both Flat and Structured Features 
Min Zhang         Jie Zhang       Jian Su      Guodong Zhou 
Institute for Infocomm Research 
21 Heng Mui Keng Terrace, Singapore 119613 
{mzhang, zhangjie, sujian, zhougd}@i2r.a-star.edu.sg 
 
Abstract 
This paper proposes a novel composite ker-
nel for relation extraction. The composite 
kernel consists of two individual kernels: an 
entity kernel that allows for entity-related 
features and a convolution parse tree kernel 
that models syntactic information of relation 
examples. The motivation of our method is 
to fully utilize the nice properties of kernel 
methods to explore diverse knowledge for 
relation extraction. Our study illustrates that 
the composite kernel can effectively capture 
both flat and structured features without the 
need for extensive feature engineering, and 
can also easily scale to include more fea-
tures. Evaluation on the ACE corpus shows 
that our method outperforms the previous 
best-reported methods and significantly out-
performs previous two dependency tree ker-
nels for relation extraction. 
1 Introduction 
The goal of relation extraction is to find various 
predefined semantic relations between pairs of 
entities in text. The research on relation extrac-
tion has been promoted by the Message Under-
standing Conferences (MUCs) (MUC, 1987-
1998) and Automatic Content Extraction (ACE) 
program (ACE, 2002-2005). According to the 
ACE Program, an entity is an object or set of ob-
jects in the world and a relation is an explicitly 
or implicitly stated relationship among entities. 
For example, the sentence “Bill Gates is chair-
man and chief software architect of Microsoft 
Corporation.” conveys the ACE-style relation 
“EMPLOYMENT.exec” between the entities 
“Bill Gates” (PERSON.Name) and “Microsoft 
Corporation” (ORGANIZATION. Commercial).  
In this paper, we address the problem of rela-
tion extraction using kernel methods (Schölkopf 
and Smola, 2001). Many feature-based learning 
algorithms involve only the dot-product between 
feature vectors. Kernel methods can be regarded 
as a generalization of the feature-based methods 
by replacing the dot-product with a kernel func-
tion between two vectors, or even between two 
objects. A kernel function is a similarity function 
satisfying the properties of being symmetric and 
positive-definite. Recently, kernel methods are 
attracting more interests in the NLP study due to 
their ability of implicitly exploring huge amounts 
of structured features using the original represen-
tation of objects. For example, the kernels for 
structured natural language data, such as parse 
tree kernel (Collins and Duffy, 2001), string ker-
nel (Lodhi et al., 2002) and graph kernel (Suzuki 
et al., 2003) are example instances of the well-
known convolution kernels
1
 in NLP. In relation 
extraction, typical work on kernel methods in-
cludes: Zelenko et al. (2003), Culotta and Soren-
sen (2004) and Bunescu and Mooney (2005). 
This paper presents a novel composite kernel 
to explore diverse knowledge for relation extrac-
tion. The composite kernel consists of an entity 
kernel and a convolution parse tree kernel. Our 
study demonstrates that the composite kernel is 
very effective for relation extraction. It also 
shows without the need for extensive feature en-
gineering the composite kernel can not only cap-
ture most of the flat features used in the previous 
work but also exploit the useful syntactic struc-
ture features effectively. An advantage of our 
method is that the composite kernel can easily 
cover more knowledge by introducing more ker-
nels. Evaluation on the ACE corpus shows that 
our method outperforms the previous best-
reported methods and significantly outperforms 
the previous kernel methods due to its effective 
exploration of various syntactic features. 
The rest of the paper is organized as follows. 
In Section 2, we review the previous work. Sec-
tion 3 discusses our composite kernel. Section 4 
reports the experimental results and our observa-
tions. Section 5 compares our method with the 
                                                 
1
 Convolution kernels were proposed for a discrete structure 
by Haussler (1999) in the machine learning field. This 
framework defines a kernel between input objects by apply-
ing convolution “sub-kernels” that are the kernels for the 
decompositions (parts) of the objects.  
825
previous work from the viewpoint of feature ex-
ploration. We conclude our work and indicate the 
future work in Section 6. 
2 Related Work 
Many techniques on relation extraction, such as 
rule-based (MUC, 1987-1998; Miller et al., 
2000), feature-based (Kambhatla 2004; Zhou et 
al., 2005) and kernel-based (Zelenko et al., 2003; 
Culotta and Sorensen, 2004; Bunescu and 
Mooney, 2005), have been proposed in the litera-
ture. 
Rule-based methods for this task employ a 
number of linguistic rules to capture various rela-
tion patterns. Miller et al. (2000) addressed the 
task from the syntactic parsing viewpoint and 
integrated various tasks such as POS tagging, NE 
tagging, syntactic parsing, template extraction 
and relation extraction using a generative model. 
Feature-based methods (Kambhatla, 2004; 
Zhou et al., 2005; Zhao and Grishman, 2005
2
) 
for this task employ a large amount of diverse 
linguistic features, such as lexical, syntactic and 
semantic features. These methods are very effec-
tive for relation extraction and show the best-
reported performance on the ACE corpus. How-
ever, the problems are that these diverse features 
have to be manually calibrated and the hierarchi-
cal structured information in a parse tree is not 
well preserved in their parse tree-related features, 
which only represent simple flat path informa-
tion connecting two entities in the parse tree 
through a path of non-terminals and a list of base 
phrase chunks. 
Prior kernel-based methods for this task focus 
on using individual tree kernels to exploit tree 
structure-related features. Zelenko et al. (2003) 
developed a kernel over parse trees for relation 
extraction. The kernel matches nodes from roots 
to leaf nodes recursively layer by layer in a top-
down manner. Culotta and Sorensen (2004) gen-
eralized it to estimate similarity between depend-
ency trees. Their tree kernels require the match-
able nodes to be at the same layer counting from 
the root and to have an identical path of ascend-
ing nodes from the roots to the current nodes. 
The two constraints make their kernel high preci-
sion but very low recall on the ACE 2003 corpus. 
Bunescu and Mooney (2005) proposed another 
dependency tree kernel for relation extraction. 
                                                 
2
 We classify the feature-based kernel defined in (Zhao and 
Grishman, 2005) into the feature-based methods since their 
kernels can be easily represented by the dot-products be-
tween explicit feature vectors. 
Their kernel simply counts the number of com-
mon word classes at each position in the shortest 
paths between two entities in dependency trees. 
The kernel requires the two paths to have the 
same length; otherwise the kernel value is zero. 
Therefore, although this kernel shows perform-
ance improvement over the previous one (Culotta 
and Sorensen, 2004), the constraint makes the 
two dependency kernels share the similar behav-
ior: good precision but much lower recall on the 
ACE corpus. 
The above discussion shows that, although 
kernel methods can explore the huge amounts of 
implicit (structured) features, until now the fea-
ture-based methods enjoy more success. One 
may ask: how can we make full use of the nice 
properties of kernel methods and define an effec-
tive kernel for relation extraction? 
In this paper, we study how relation extraction 
can benefit from the elegant properties of kernel 
methods: 1) implicitly exploring (structured) fea-
tures in a high dimensional space; and 2) the nice 
mathematical properties, for example, the sum, 
product, normalization and polynomial expan-
sion of existing kernels is a valid kernel 
(Schölkopf and Smola, 2001). We also demon-
strate how our composite kernel effectively cap-
tures the diverse knowledge for relation extrac-
tion.  
3 Composite Kernel for Relation Ex-
traction  
In this section, we define the composite kernel 
and study the effective representation of a rela-
tion instance. 
3.1 Composite Kernel 
Our composite kernel consists of an entity kernel 
and a convolution parse tree kernel. To our 
knowledge, convolution kernels have not been 
explored for relation extraction. 
 
(1) Entity Kernel: The ACE 2003 data defines 
four entity features: entity headword, entity type 
and subtype (only for GPE), and mention type 
while the ACE 2004 data makes some modifica-
tions and introduces a new feature “LDC men-
tion type”. Our statistics on the ACE data reveals 
that the entity features impose a strong constraint 
on relation types. Therefore, we design a linear 
kernel to explicitly capture such features: 
12 1 2
1,2
(, ) (., .)
LEi
i
KRR KRERE
=
=
∑
 (1) 
where
1
R and
2
R stands for two relation instances, 
E
i
 means the i
th
 entity of a relation instance, and 
826
(,)
E
K ••  is a simple kernel function over the fea-
tures of entities: 
12 1 2
(, ) (., .)
Eii
i
KEE CEfEf=
∑
(2) 
where 
i
f represents the i
th
 entity feature, and the 
function (,)C ••  returns 1 if the two feature val-
ues are identical and 0 otherwise. (,)
E
K • •  re-
turns the number of feature values in common of 
two entities. 
 
(2) Convolution Parse Tree Kernel: A convo-
lution kernel aims to capture structured informa-
tion in terms of substructures. Here we use the 
same convolution parse tree kernel as described 
in Collins and Duffy (2001) for syntactic parsing 
and Moschitti (2004) for semantic role labeling. 
Generally, we can represent a parse tree T  by a 
vector of integer counts of each sub-tree type 
(regardless of its ancestors):  
 
()Tφ =(# subtree
1
(T), …, # subtree
i
(T), …,  # 
subtree
n
(T) ) 
 
where # subtree
i
(T) is the occurrence number of 
the i
th
 sub-tree type (subtree
i
) in T. Since the 
number of different sub-trees is exponential with 
the parse tree size, it is computationally infeasi-
ble to directly use the feature vector ()Tφ . To 
solve this computational issue, Collins and Duffy 
(2001) proposed the following parse tree kernel 
to calculate the dot product between the above 
high dimensional vectors implicitly. 
 
 
11 2 2
11 2 2
12 1 2
12
12
(, ) (),( )
   #()#()
   () ()
   (, )
()
ii
i
subtree subtree
inN nN
nN nN
KTT T T
subtree T subtree T
In In
nn
φ φ
∈∈
∈∈
=< >
=
=
=∆
⋅
⋅
∑
∑∑ ∑
∑∑
(3) 
 
where N
1
 and N
2
 are the sets of nodes in trees T
1
 
and T
2
, respectively, and ()
i
subtree
In is a function 
that is 1 iff the subtree
i
 occurs with root at node n 
and zero otherwise, and 
12
(, )nn∆  is the number of 
the common subtrees rooted at n
1
 and n
2
, i.e. 
 
12 1 2
(, ) () ()
ii
subtree subtree
i
nn I n I n∆= ⋅
∑
 
 
12
(, )nn∆ can be computed by the following recur-
sive rules:  
(1) if the productions (CFP rules) at 
1
n  and 
2
n  
are different, 
12
(, ) 0nn∆=
; 
(2) else if both
1
n  and 
2
n  are pre-terminals (POS 
tags), 
12
(, )1nn λ∆=×; 
(3) else, 
1
()
12 1 2
1
(, ) (1 ( (,), (,))
nc n
j
n n ch n j ch n jλ
=
∆= +∆
∏
,  
where 
1
()nc n is the child number of 
1
n , ch(n,j) is 
the j
th
 child of node n  andλ (0<λ <1) is the de-
cay factor in order to make the kernel value less 
variable with respect to the subtree sizes. In ad-
dition, the recursive rule (3) holds because given 
two nodes with the same children, one can con-
struct common sub-trees using these children and 
common sub-trees of further offspring.  
The parse tree kernel counts the number of 
common sub-trees as the syntactic similarity 
measure between two relation instances. The 
time complexity for computing this kernel 
is
12
(| | | |)ON N⋅ . 
In this paper, two composite kernels are de-
fined by combing the above two individual ker-
nels in the following ways: 
 
1) Linear combination: 
 
112 12 12
ˆˆ
(, ) (, )(1 ) (,)
L
KRR K RR KTTαα••=+−(4) 
 
Here, 
ˆ
(,)K • •  is the normalized
3
(,)K •• and α  
is the coefficient. Evaluation on the development 
set shows that this composite kernel yields the 
best performance when α is set to 0.4. 
 
2) Polynomial expansion: 
 
212 12 12
ˆˆ
(, ) (, )(1 ) (,)
P
L
KRR K RR KTTαα••=+−(5) 
 
Here, 
ˆ
(,)K • •  is the normalized (,)K •• , (,)
p
K • •  
is the polynomial expansion of (,)K ••  with de-
gree d=2, i.e.
2
(,) ( (,) 1)
p
KK•• ••=+, and α  is the 
coefficient. Evaluation on the development set 
shows that this composite kernel yields the best 
performance when α is set to 0.23. 
The polynomial expansion aims to explore the 
entity bi-gram features, esp. the combined fea-
tures from the first and second entities, respec-
tively. In addition, due to the different scales of 
the values of the two individual kernels, they are 
normalized before combination. This can avoid 
one kernel value being overwhelmed by that of 
another one.  
The entity kernel formulated by eqn. (1) is a 
proper kernel since it simply calculates the dot 
product of the entity feature vectors. The tree 
kernel formulated by eqn. (3) is proven to be a 
proper kernel (Collins and Duffy, 2001). Since 
kernel function set is closed under normalization, 
polynomial expansion and linear combination 
(Schölkopf and Smola, 2001), the two composite 
kernels are also proper kernels. 
                                                 
3
 A kernel (, )K x y  can be normalized by dividing it by 
(,) (, )K xx Kyy•
.  
827
3.2 Relation Instance Spaces 
A relation instance is encapsulated by a parse 
tree. Thus, it is critical to understand which por-
tion of a parse tree is important in the kernel cal-
culation. We study five cases as shown in Fig.1. 
 
(1) Minimum Complete Tree (MCT): the com-
plete sub-tree rooted by the nearest common an-
cestor of the two entities under consideration. 
 
(2) Path-enclosed Tree (PT): the smallest com-
mon sub-tree including the two entities. In other 
words, the sub-tree is enclosed by the shortest 
path linking the two entities in the parse tree (this 
path is also commonly-used as the path tree fea-
ture in the feature-based methods). 
 
(3) Context-Sensitive Path Tree (CPT): the PT 
extended with the 1
st
 left word of entity 1 and the 
1
st
 right word of entity 2. 
 
(4) Flattened Path-enclosed Tree (FPT): the 
PT with the single in and out arcs of non-
terminal nodes (except POS nodes) removed. 
 
(5) Flattened CPT (FCPT): the CPT with the 
single in and out arcs of non-terminal nodes (ex-
cept POS nodes) removed.  
 
Fig. 1 illustrates different representations of an 
example relation instance. T
1
 is MCT for the 
relation instance, where the sub-tree circled by a 
dashed line is PT, which is also shown in T
2 
for 
clarity. The only difference between MCT and 
PT lies in that MCT does not allow partial pro-
duction rules (for example, NP �PP is a partial 
production rule while NP �NP+PP is an entire 
production rule in the top of T
2
). For instance, 
only the most-right child in the most-left sub-tree 
[NP [CD 200] [JJ domestic] [E1-PER …]] of T
1 
is kept in T
2
. By comparing the performance of 
T
1
 and T
2,
 we can evaluate the effect of sub-trees 
with partial production rules as shown in T
2 
and 
the necessity of keeping the whole left and right 
context sub-trees as shown in T
1 
in relation ex-
traction. T
3
 is CPT, where the two sub-trees cir-
cled by dashed lines are included as the context 
to T
2 
and make T
3 
context-sensitive. This is to 
evaluate whether the limited context information 
in CPT can boost performance. FPT in T
4
 is 
formed by removing the two circled nodes in T
2
. 
This is to study whether and how the elimination 
of single non-terminal nodes affects the perform-
ance of relation extraction.  
T
1
): MCT 
T
2
): PT 
T
3
):CPT 
T
4
): FPT 
Figure 1. Different representations of a relation instance in the example sentence “…provide bene-
fits to 200 domestic partners of their own workers in New York”, where the phrase type 
“E1-PER” denotes that the current node is the 1
st
 entity with type “PERSON”, and like-
wise for the others. The relation instance is excerpted from the ACE 2003 corpus, where 
a relation “SOCIAL.Other-Personal” exists between entities “partners” (PER) and 
“workers” (PER). We use Charniak’s parser (Charniak, 2001) to parse the example sen-
tence. To save space, the FCPT is not shown here. 828
4 Experiments 
4.1 Experimental Setting 
Data: We use the English portion of both the 
ACE 2003 and 2004 corpora from LDC in our 
experiments. In the ACE 2003 data, the training 
set consists of 674 documents and 9683 relation 
instances while the test set consists of 97 docu-
ments and 1386 relation instances. The ACE 
2003 data defines 5 entity types, 5 major relation 
types and 24 relation subtypes. The ACE 2004 
data contains 451 documents and 5702 relation 
instances. It redefines 7 entity types, 7 major re-
lation types and 23 subtypes. Since Zhao and 
Grishman (2005) use a 5-fold cross-validation on 
a subset of the 2004 data (newswire and broad-
cast news domains, containing 348 documents 
and 4400 relation instances), for comparison, we 
use the same setting (5-fold cross-validation on 
the same subset of the 2004 data, but the 5 parti-
tions may not be the same) for the ACE 2004 
data. Both corpora are parsed using Charniak’s 
parser (Charniak, 2001). We iterate over all pairs 
of entity mentions occurring in the same sen-
tence to generate potential relation instances. In 
this paper, we only measure the performance of 
relation extraction models on “true” mentions 
with “true” chaining of coreference (i.e. as anno-
tated by LDC annotators). 
 
Implementation: We formalize relation extrac-
tion as a multi-class classification problem. SVM 
is selected as our classifier. We adopt the one vs. 
others strategy and select the one with the largest 
margin as the final answer. The training parame-
ters are chosen using cross-validation (C=2.4 
(SVM); λ =0.4(tree kernel)). In our implementa-
tion, we use the binary SVMLight (Joachims, 
1998) and Tree Kernel Tools (Moschitti, 2004). 
Precision (P), Recall (R) and F-measure (F) are 
adopted to measure the performance. 
4.2 Experimental Results 
In this subsection, we report the experiments of 
different kernel setups for different purposes. 
 
(1) Tree Kernel only over Different Relation 
Instance Spaces: In order to better study the im-
pact of the syntactic structure information in a 
parse tree on relation extraction, we remove the 
entity-related information from parse trees by 
replacing the entity-related phrase types (“E1-
PER” and so on as shown in Fig. 1) with “NP”. 
Table 1 compares the performance of 5 tree ker-
nel setups on the ACE 2003 data using the tree 
structure information only. It shows that:  
 
• Overall the five different relation instance 
spaces are all somewhat effective for relation 
extraction. This suggests that structured syntactic 
information has good predication power for rela-
tion extraction and the structured syntactic in-
formation can be well captured by the tree kernel.  
• MCT performs much worse than the others. 
The reasons may be that MCT includes too 
much left and right context information, which 
may introduce many noisy features and cause 
over-fitting (high precision and very low recall 
as shown in Table 1). This suggests that only 
keeping the complete (not partial) production 
rules in MCT does harm performance. 
• PT achieves the best performance. This means 
that only keeping the portion of a parse tree en-
closed by the shortest path between entities can 
model relations better than all others. This may 
be due to that most significant information is 
with PT and including context information may 
introduce too much noise. Although context 
may include some useful information, it is still a 
problem to correctly utilize such useful informa-
tion in the tree kernel for relation extraction. 
• CPT performs a bit worse than PT. In some 
cases (e.g. in sentence “the merge of company A 
and company B….”, “merge” is a critical con-
text word), the context information is helpful. 
However, the effective scope of context is hard 
to determine given the complexity and variabil-
ity of natural languages. 
• The two flattened trees perform worse than the 
original trees. This suggests that the single non-
terminal nodes are useful for relation extraction.  
Evaluation on the ACE 2004 data also shows 
that PT achieves the best performance (72.5/56.7 
/63.6 in P/R/F). More evaluations with the entity 
type and order information incorporated into tree 
nodes (“E1-PER”, “E2-PER” and “E-GPE” as 
shown in Fig. 1) also show that PT performs best 
with 76.1/62.6/68.7 in P/R/F on the 2003 data 
and 74.1/62.4/67.7 in P/R/F on the 2004 data. 
 
Instance Spaces P(%) R(%) F 
Minimum Complete Tree 
(MCT) 
77.5 38.4 51.3 
Path-enclosed Tree (PT) 72.8 53.8 61.9 
Context-Sensitive PT(CPT) 75.9 48.6 59.2 
Flattened PT 72.7 51.7 60.4 
Flattened CPT 76.1 47.2 58.2 
 
Table 1. five different tree kernel setups on the 
ACE 2003 five major types using the parse 
tree structure information only (regardless of 
any entity-related information) 
829
PTs (with Tree Struc-
ture Information only) 
P(%) R(%) F 
Entity kernel only 
75.1 
(79.5) 
42.7 
(34.6)
54.4 
(48.2) 
Tree kernel only 
72.5 
(72.8) 
56.7 
(53.8)
63.6 
(61.9) 
Composite kernel 1 
(linear combination) 
73.5 
(76.3) 
67.0 
(63.0)
70.1 
(69.1) 
Composite kernel 2 
(polynomial expansion)
76.1 
(77.3) 
68.4 
(65.6)
72.1 
(70.9) 
 
Table 2. Performance comparison of different 
kernel setups over the ACE major types of 
both the 2003 data (the numbers in parenthe-
ses) and the 2004 data (the numbers outside 
parentheses) 
 
(2) Composite Kernels: Table 2 compares the 
performance of different kernel setups on the 
ACE major types. It clearly shows that:  
• The composite kernels achieve significant per-
formance improvement over the two individual 
kernels. This indicates that the flat and the struc-
tured features are complementary and the com-
posite kernels can well integrate them: 1) the 
flat entity information captured by the entity 
kernel; 2) the structured syntactic connection 
information between the two entities captured 
by the tree kernel. 
 
• The composite kernel via the polynomial ex-
pansion outperforms the one via the linear com-
bination by ~2 in F-measure. It suggests that the 
bi-gram entity features are very useful.  
 
• The entity features are quite useful, which can 
achieve F-measures of 54.4/48.2 alone and can 
boost the performance largely by ~7 (70.1-
63.2/69.1-61.9) in F-measure when combining 
with the tree kernel.  
 
• It is interesting that the ACE 2004 data shows 
consistent better performance on all setups than 
the 2003 data although the ACE 2003 data is 
two times larger than the ACE 2004 data. This 
may be due to two reasons: 1) The ACE 2004 
data defines two new entity types and re-defines 
the relation types and subtypes in order to re-
duce the inconsistency between LDC annota-
tors. 2) More importantly, the ACE 2004 data 
defines 43 entity subtypes while there are only 3 
subtypes in the 2003 data. The detailed classifi-
cation in the 2004 data leads to significant per-
formance improvement of 6.2 (54.4-48.2) in F-
measure over that on the 2003 data. 
Our composite kernel can achieve 
77.3/65.6/70.9 and 76.1/68.4/72.1 in P/R/F over 
the ACE 2003/2004 major types, respectively. 
Methods (2002/2003 data) P(%) R(%) F 
Ours: composite kernel 2 
(polynomial expansion) 
77.3 
(64.9) 
65.6 
(51.2) 
70.9 
(57.2) 
Zhou et al. (2005):  
feature-based SVM 
77.2 
(63.1) 
60.7 
(49.5) 
68.0 
(55.5) 
Kambhatla (2004):  
feature-based ME 
 (-) 
(63.5) 
 (-) 
(45.2) 
 (-) 
(52.8) 
Ours: tree kernel with en-
tity information at node 
76.1 
(62.4) 
62.6 
(48.5) 
68.7 
(54.6) 
Bunescu and Mooney 
(2005): shortest path de-
pendency kernel 
65.5 
(-) 
43.8 
(-) 
52.5 
(-) 
Culotta and Sorensen 
(2004): dependency kernel 
67.1 
(-) 
35.0 
(-) 
45.8 
(-) 
 
Table 3. Performance comparison on the ACE 
2003/2003 data over both 5 major types (the 
numbers outside parentheses) and 24 subtypes 
(the numbers in parentheses)  
 
Methods (2004 data) P(%) R(%) F 
Ours: composite kernel 2 
(polynomial expansion) 
76.1 
 (68.6) 
68.4 
(59.3)
72.1 
 (63.6)
Zhao and Grishman (2005): 
feature-based kernel 
69.2 
(-) 
70.5 
(-) 
70.4 
(-) 
 
Table 4. Performance comparison on the ACE 
2004 data over both 7 major types (the numbers 
outside parentheses) and 23 subtypes (the num-
bers in parentheses) 
 
(3) Performance Comparison: Tables 3 and 4 
compare our method with previous work on the 
ACE 2002/2003/2004 data, respectively. They 
show that our method outperforms the previous 
methods and significantly outperforms the previ-
ous two dependency kernels
4
. This may be due to 
two reasons: 1) the dependency tree (Culotta and 
Sorensen, 2004) and the shortest path (Bunescu 
and Mooney, 2005) lack the internal hierarchical 
phrase structure information, so their correspond-
ing kernels can only carry out node-matching 
directly over the nodes with word tokens; 2) the 
parse tree kernel has less constraints. That is, it is 
                                                 
4
 Bunescu and Mooney (2005) used the ACE 2002 corpus, 
including 422 documents, which is known to have many 
inconsistencies than the 2003 version. Culotta and Sorensen 
(2004) used a generic ACE corpus including about 800 
documents (no corpus version is specified). Since the testing 
corpora are in different sizes and versions, strictly speaking, 
it is not ready to compare these methods exactly and fairly. 
Therefore Table 3 is only for reference purpose. We just 
hope that we can get a few clues from this table. 
830
not restricted by the two constraints of the two 
dependency kernels (identical layer and ances-
tors for the matchable nodes and identical length 
of two shortest paths, as discussed in Section 2).  
 
The above experiments verify the effective-
ness of our composite kernels for relation extrac-
tion. They suggest that the parse tree kernel can 
effectively explore the syntactic features which 
are critical for relation extraction.  
 
# of error instances Error Type 
  2004 data 2003 data 
False Negative 
198  416 
False Positive 115 171 
Cross Type 62 96 
 
Table 5. Error distribution of major types on 
both the 2003 and 2004 data for the compos-
ite kernel by polynomial expansion 
 
(4) Error Analysis: Table 5 reports the error 
distribution of the polynomial composite kernel 
over the major types on the ACE data. It shows 
that 83.5%(198+115/198+115+62) / 85.8%(416 
+171/416+171+96) of the errors result from rela-
tion detection and only 16.5%/14.2% of the er-
rors result from relation characterization. This 
may be due to data imbalance and sparseness 
issues since we find that the negative samples are 
8 times more than the positive samples in the 
training set. Nevertheless, it clearly directs our 
future work. 
5 Discussion 
In this section, we compare our method with the 
previous work from the feature engineering 
viewpoint and report some other observations 
and issues in our experiments. 
5.1 Comparison with Previous Work 
This is to explain more about why our method 
performs better and significantly outperforms the 
previous two dependency tree kernels from the 
theoretical viewpoint. 
(1) Compared with Feature-based Methods: 
The basic difference lies in the relation instance 
representation (parse tree vs. feature vector) and 
the similarity calculation mechanism (kernel 
function vs. dot-product). The main difference is 
the different feature spaces. Regarding the parse 
tree features, our method implicitly represents a 
parse tree by a vector of integer counts of each 
sub-tree type, i.e., we consider the entire sub-tree 
types and their occurring frequencies. In this way, 
the parse tree-related features (the path features 
and the chunking features) used in the feature-
based methods are embedded (as a subset) in our 
feature space. Moreover, the in-between word 
features and the entity-related features used in 
the feature-based methods are also captured by 
the tree kernel and the entity kernel, respectively. 
Therefore our method has the potential of effec-
tively capturing not only most of the previous 
flat features but also the useful syntactic struc-
ture features. 
 
(2) Compared with Previous Kernels: Since 
our method only counts the occurrence of each 
sub-tree without considering the layer and the 
ancestors of the root node of the sub-tree, our 
method is not limited by the constraints (identi-
cal layer and ancestors for the matchable nodes, 
as discussed in Section 2) in Culotta and Soren-
sen (2004). Moreover, the difference between 
our method and Bunescu and Mooney (2005) is 
that their kernel is defined on the shortest path 
between two entities instead of the entire sub-
trees. However, the path does not maintain the 
tree structure information. In addition, their ker-
nel requires the two paths to have the same 
length. Such constraint is too strict. 
5.2 Other Issues 
(1) Speed Issue: The recursively-defined convo-
lution kernel is much slower compared to fea-
ture-based classifiers. In this paper, the speed 
issue is solved in three ways. First, the inclusion 
of the entity kernel makes the composite kernel 
converge fast. Furthermore, we find that the 
small portion (PT) of a full parse tree can effec-
tively represent a relation instance. This signifi-
cantly improves the speed. Finally, the parse tree 
kernel requires exact match between two sub-
trees, which normally does not occur very fre-
quently. Collins and Duffy (2001) report that in 
practice, running time for the parse tree kernel is 
more close to linear (O(|N
1
|+|N
2
|), rather than 
O(|N
1
|*|N
2
| ). As a result, using the PC with Intel 
P4 3.0G CPU and 2G RAM, our system only 
takes about 110 minutes and 30 minutes to do 
training on the ACE 2003 (~77k training in-
stances) and 2004 (~33k training instances) data, 
respectively.  
(2) Further Improvement: One of the potential 
problems in the parse tree kernel is that it carries 
out exact matches between sub-trees, so that this 
kernel fails to handle sparse phrases (i.e. “a car” 
vs. “a red car”) and near-synonymic grammar 
tags (for example, the variations of a verb (i.e. 
go, went, gone)). To some degree, it could possi-
bly lead to over-fitting and compromise the per-
831
formance. However, the above issues can be 
handled by allowing grammar-driven partial rule 
matching and other approximate matching 
mechanisms in the parse tree kernel calculation. 
Finally, it is worth noting that by introducing 
more individual kernels our method can easily 
scale to cover more features from a multitude of 
sources (e.g. Wordnet, gazetteers, etc) that can 
be brought to bear on the task of relation extrac-
tion. In addition, we can also easily implement 
the feature weighting scheme by adjusting the 
eqn.(2) and the rule (2) in calculating 
12
(, )nn∆  
(see subsection 3.1). 
6 Conclusion and Future Work 
Kernel functions have nice properties. In this 
paper, we have designed a composite kernel for 
relation extraction. Benefiting from the nice 
properties of the kernel methods, the composite 
kernel could well explore and combine the flat 
entity features and the structured syntactic fea-
tures, and therefore outperforms previous best-
reported feature-based methods on the ACE cor-
pus. To our knowledge, this is the first research 
to demonstrate that, without the need for exten-
sive feature engineering, an individual tree ker-
nel achieves comparable performance with the 
feature-based methods. This shows that the syn-
tactic features embedded in a parse tree are par-
ticularly useful for relation extraction and which 
can be well captured by the parse tree kernel. In 
addition, we find that the relation instance repre-
sentation (selecting effective portions of parse 
trees for kernel calculations) is very important 
for relation extraction. 
The most immediate extension of our work is 
to improve the accuracy of relation detection. 
This can be done by capturing more features by 
including more individual kernels, such as the 
WordNet-based semantic kernel (Basili et al., 
2005) and other feature-based kernels. We can 
also benefit from machine learning algorithms to 
study how to solve the data imbalance and 
sparseness issues from the learning algorithm 
viewpoint. In the future work, we will design a 
more flexible tree kernel for more accurate simi-
larity measure.  
 
Acknowledgements: We would like to thank 
Dr. Alessandro Moschitti for his great help in 
using his Tree Kernel Toolkits and fine-tuning 
the system. We also would like to thank the three 
anonymous reviewers for their invaluable sug-
gestions. 
References 
ACE. 2002-2005. The Automatic Content Extraction 
Projects. http://www.ldc.upenn.edu/Projects /ACE/ 
Basili R., Cammisa M. and Moschitti A. 2005. A Se-
mantic Kernel to classify text with very few train-
ing examples. ICML-2005 
Bunescu R. C. and Mooney R. J. 2005. A Shortest 
Path Dependency Kernel for Relation Extraction. 
EMNLP-2005 
Charniak E. 2001. Immediate-head Parsing for Lan-
guage Models. ACL-2001 
Collins M. and Duffy N. 2001. Convolution Kernels 
for Natural Language. NIPS-2001 
Culotta A. and Sorensen J. 2004. Dependency Tree 
Kernel for Relation Extraction. ACL-2004 
Haussler D. 1999. Convolution Kernels on Discrete 
Structures. Technical Report UCS-CRL-99-10, 
University of California, Santa Cruz. 
Joachims T. 1998. Text Categorization with Support 
Vecor Machine: learning with many relevant fea-
tures. ECML-1998 
Kambhatla N. 2004. Combining lexical, syntactic and 
semantic features with Maximum Entropy models 
for extracting relations. ACL-2004 (poster) 
Lodhi H., Saunders C., Shawe-Taylor J., Cristianini 
N. and Watkins C. 2002. Text classification using 
string kernel. Journal of Machine Learning Re-
search, 2002(2):419-444 
Miller S., Fox H., Ramshaw L. and Weischedel R. 
2000. A novel use of statistical parsing to extract 
information from text. NAACL-2000 
Moschitti A. 2004. A Study on Convolution Kernels 
for Shallow Semantic Parsing. ACL-2004 
MUC. 1987-1998. http://www.itl.nist.gov/iaui/894.02/ 
related_projects/muc/ 
Schölkopf B. and Smola A. J. 2001. Learning with 
Kernels: SVM, Regularization, Optimization and 
Beyond. MIT Press, Cambridge, MA 407-423 
Suzuki J., Hirao T., Sasaki Y. and Maeda E. 2003. 
Hierarchical Directed Acyclic Graph Kernel: 
Methods for Structured Natural Language Data. 
ACL-2003 
Zelenko D., Aone C. and Richardella A. 2003. Kernel 
Methods for Relation Extraction. Journal of Ma-
chine Learning Research. 2003(2):1083-1106 
Zhao S.B. and Grishman R. 2005. Extracting Rela-
tions with Integrated Information Using Kernel 
Methods. ACL-2005 
Zhou G.D., Su J, Zhang J. and Zhang M. 2005. Ex-
ploring Various Knowledge in Relation Extraction. 
ACL-2005 
832
