Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 121–128,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Modeling Commonality among Related Classes in Relation Extraction 
 
Zhou GuoDong      Su Jian      Zhang Min 
Institute for Infocomm Research 
21 Heng Mui Keng Terrace, Singapore 119613 
Email: {zhougd, sujian, mzhang}@i2r.a-star.edu.sg 
 
Abstract 
This paper proposes a novel hierarchical learn-
ing strategy to deal with the data sparseness 
problem in relation extraction by modeling the 
commonality among related classes. For each 
class in the hierarchy either manually prede-
fined or automatically clustered, a linear dis-
criminative function is determined in a top-
down way using a perceptron algorithm with 
the lower-level weight vector derived from the 
upper-level weight vector. As the upper-level 
class normally has much more positive train-
ing examples than the lower-level class, the 
corresponding linear discriminative function 
can be determined more reliably. The upper-
level discriminative function then can effec-
tively guide the discriminative function learn-
ing in the lower-level, which otherwise might 
suffer from limited training data. Evaluation 
on the ACE RDC 2003 corpus shows that the 
hierarchical strategy much improves the per-
formance by 5.6 and 5.1 in F-measure on 
least- and medium- frequent relations respec-
tively. It also shows that our system outper-
forms the previous best-reported system by 2.7 
in F-measure on the 24 subtypes using the 
same feature set. 
1 Introduction 
With the dramatic increase in the amount of tex-
tual information available in digital archives and 
the WWW, there has been growing interest in 
techniques for automatically extracting informa-
tion from text. Information Extraction (IE) is 
such a technology that IE systems are expected 
to identify relevant information (usually of pre-
defined types) from text documents in a certain 
domain and put them in a structured format. 
According to the scope of the ACE program 
(ACE 2000-2005), current research in IE has 
three main objectives: Entity Detection and 
Tracking (EDT), Relation Detection and 
Characterization (RDC), and Event Detection 
and Characterization (EDC). This paper will 
focus on the ACE RDC task, which detects and 
classifies various semantic relations between two 
entities. For example, we want to determine 
whether a person is at a location, based on the 
evidence in the context. Extraction of semantic 
relationships between entities can be very useful 
for applications such as question answering, e.g. 
to answer the query “Who is the president of the 
United States?”.  
One major challenge in relation extraction is 
due to the data sparseness problem (Zhou et al 
2005). As the largest annotated corpus in relation 
extraction, the ACE RDC 2003 corpus shows 
that different subtypes/types of relations are 
much unevenly distributed and a few relation 
subtypes, such as the subtype “Founder” under 
the type “ROLE”, suffers from a small amount of 
annotated data. Further experimentation in this 
paper (please see Figure 2) shows that most rela-
tion subtypes suffer from the lack of the training 
data and fail to achieve steady performance given 
the current corpus size. Given the relative large 
size of this corpus, it will be time-consuming and 
very expensive to further expand the corpus with 
a reasonable gain in performance. Even if we can 
somehow expend the corpus and achieve steady 
performance on major relation subtypes, it will 
be still far beyond practice for those minor sub-
types given the much unevenly distribution 
among different relation subtypes. While various 
machine learning approaches, such as generative 
modeling (Miller et al 2000), maximum entropy 
(Kambhatla 2004) and support vector machines 
(Zhao and Grisman 2005; Zhou et al 2005), have 
been applied in the relation extraction task, no 
explicit learning strategy is proposed to deal with 
the inherent data sparseness problem caused by 
the much uneven distribution among different 
relations.  
This paper proposes a novel hierarchical 
learning strategy to deal with the data sparseness 
problem by modeling the commonality among 
related classes. Through organizing various 
classes hierarchically, a linear discriminative 
function is determined for each class in a top-
down way using a perceptron algorithm with the 
lower-level weight vector derived from the up-
per-level weight vector. Evaluation on the ACE 
RDC 2003 corpus shows that the hierarchical 
121
strategy achieves much better performance than 
the flat strategy on least- and medium-frequent 
relations. It also shows that our system based on 
the hierarchical strategy outperforms the previ-
ous best-reported system. 
The rest of this paper is organized as follows. 
Section 2 presents related work. Section 3 
describes the hierarchical learning strategy using 
the perceptron algorithm. Finally, we present 
experimentation in Section 4 and conclude this 
paper in Section 5.  
2 Related Work 
The relation extraction task was formulated at 
MUC-7(1998). With the increasing popularity of 
ACE, this task is starting to attract more and 
more researchers within the natural language 
processing and machine learning communities. 
Typical works include Miller et al (2000), Ze-
lenko et al (2003), Culotta and Sorensen (2004), 
Bunescu and Mooney (2005a), Bunescu and 
Mooney  (2005b), Zhang et al (2005), Roth and 
Yih (2002), Kambhatla (2004), Zhao and Grisman  
(2005) and Zhou et al (2005). 
Miller et al (2000) augmented syntactic full 
parse trees with semantic information of entities 
and relations, and built generative models to in-
tegrate various tasks such as POS tagging, named 
entity recognition, template element extraction 
and relation extraction. The problem is that such 
integration may impose big challenges, e.g. the 
need of a large annotated corpus. To overcome 
the data sparseness problem, generative models 
typically applied some smoothing techniques to 
integrate different scales of contexts in parameter 
estimation, e.g. the back-off approach in Miller 
et al (2000).  
Zelenko et al (2003) proposed extracting re-
lations by computing kernel functions between 
parse trees. Culotta and Sorensen (2004) extended 
this work to estimate kernel functions between 
augmented dependency trees and achieved F-
measure of 45.8 on the 5 relation types in the 
ACE RDC 2003 corpus
1
. Bunescu and Mooney 
(2005a) proposed a shortest path dependency 
kernel. They argued that the information to 
model a relationship between two entities can be 
typically captured by the shortest path between 
them in the dependency graph. It achieved the F-
measure of 52.5 on the 5 relation types in the 
ACE RDC 2003 corpus. Bunescu and Mooney 
(2005b) proposed a subsequence kernel and ap-
                                                           
1
 The ACE RDC 2003 corpus defines 5/24 relation 
types/subtypes between 4 entity types. 
plied it in protein interaction and ACE relation 
extraction tasks. Zhang et al (2005) adopted clus-
tering algorithms in unsupervised relation extrac-
tion using tree kernels. To overcome the data 
sparseness problem, various scales of sub-trees 
are applied in the tree kernel computation. Al-
though tree kernel-based approaches are able to 
explore the huge implicit feature space without 
much feature engineering, further research work 
is necessary to make them effective and efficient. 
Comparably, feature-based approaches 
achieved much success recently. Roth and Yih 
(2002) used the SNoW classifier to incorporate 
various features such as word, part-of-speech and 
semantic information from WordNet, and pro-
posed a probabilistic reasoning approach to inte-
grate named entity recognition and relation 
extraction. Kambhatla (2004) employed maxi-
mum entropy models with features derived from 
word, entity type, mention level, overlap, de-
pendency tree, parse tree and achieved F-
measure of 52.8 on the 24 relation subtypes in 
the ACE RDC 2003 corpus. Zhao and Grisman 
(2005)
2
 combined various kinds of knowledge 
from tokenization, sentence parsing and deep 
dependency analysis through support vector ma-
chines and achieved F-measure of 70.1 on the 7 
relation types of the ACE RDC 2004 corpus
3
. 
Zhou et al (2005) further systematically explored 
diverse lexical, syntactic and semantic features 
through support vector machines and achieved F-
measure of 68.1 and 55.5 on the 5 relation types 
and the 24 relation subtypes in the ACE RDC 
2003 corpus respectively. To overcome the data 
sparseness problem, feature-based approaches 
normally incorporate various scales of contexts 
into the feature vector extensively. These ap-
proaches then depend on adopted learning algo-
rithms to weight and combine each feature 
effectively. For example, an exponential model 
and a linear model are applied in the maximum 
entropy models and support vector machines re-
spectively to combine each feature via the 
learned weight vector. 
In summary, although various approaches 
have been employed in relation extraction, they 
implicitly attack the data sparseness problem by 
using features of different contexts in feature-
based approaches or including different sub-
                                                           
2
 Here, we classify this paper into feature-based ap-
proaches since the feature space in the kernels of 
Zhao and Grisman (2005) can be easily represented 
by an explicit feature vector. 
3
 The ACE RDC 2004 corpus defines 7/27 relation 
types/subtypes between 7 entity types. 
122
structures in kernel-based approaches. Until now, 
there are no explicit ways to capture the hierar-
chical topology in relation extraction. Currently, 
all the current approaches apply the flat learning 
strategy which equally treats training examples 
in different relations independently and ignore 
the commonality among different relations. This 
paper proposes a novel hierarchical learning 
strategy to resolve this problem by considering 
the relatedness among different relations and 
capturing the commonality among related rela-
tions. By doing so, the data sparseness problem 
can be well dealt with and much better perform-
ance can be achieved, especially for those rela-
tions with small amounts of annotated examples.  
3 Hierarchical Learning Strategy 
Traditional classifier learning approaches apply 
the flat learning strategy. That is, they equally 
treat training examples in different classes 
independently and ignore the commonality 
among related classes. The flat strategy will not 
cause any problem when there are a large amount 
of training examples for each class, since, in this 
case, a classifier learning approach can always 
learn a nearly optimal discriminative function for 
each class against the remaining classes. How-
ever, such flat strategy may cause big problems 
when there is only a small amount of training 
examples for some of the classes. In this case, a 
classifier learning approach may fail to learn a 
reliable (or nearly optimal) discriminative func-
tion for a class with a small amount of training 
examples, and, as a result, may significantly af-
fect the performance of the class or even the 
overall performance. 
To overcome the inherent problems in the 
flat strategy, this paper proposes a hierarchical 
learning strategy which explores the inherent 
commonality among related classes through a 
class hierarchy. In this way, the training exam-
ples of related classes can help in learning a reli-
able discriminative function for a class with only 
a small amount of training examples. To reduce 
computation time and memory requirements, we 
will only consider linear classifiers and apply the 
simple and widely-used perceptron algorithm for 
this purpose with more options open for future 
research. In the following, we will first introduce 
the perceptron algorithm in linear classifier 
learning, followed by the hierarchical learning 
strategy using the perceptron algorithm. Finally, 
we will consider several ways in building the 
class hierarchy. 
3.1 Perceptron Algorithm 
_______________________________________ 
Input:  
the initial weight vector 
w
, 
the training 
example sequence 
TtYXyx
tt
...,2,1,),( =×∈
and the number of 
the maximal iterations N (e.g. 10 in this 
paper) of the training sequence
4
  
Output: the weight vector
 
w
 for the linear 
discriminative function  
xwf ⋅=  
BEGIN 
    
ww =
1
 
    REPEAT for t=1,2,…,T*N 
1. Receive the instance 
n
t
Rx ∈  
2. Compute the output 
ttt
xwo ⋅=  
3. Give the prediction )(
tt
osigny =
∧
 
4. Receive the desired label }1,1{ +−∈
t
y  
5. Update the hypothesis according to   
   
ttttt
xyww δ+=
+1
           (1) 
                where 0=
t
δ if the margin of 
t
w  at the 
given example ),(
tt
yx  0>⋅
ttt
xwy  
and 1=
t
δ  otherwise 
    END REPEAT 
    Return 
5/
4
1*∑
−=
+
=
N
Ni
iT
ww  
END BEGIN 
_______________________________________ 
Figure 1: the perceptron algorithm 
This section first deals with binary classification 
using linear classifiers. Assume an instance space 
n
RX =  and a binary label space }1,1{ +−=Y . 
With any weight vector 
n
Rw∈  and a given 
instance 
n
Rx∈ , we associate a linear classifier 
w
h  with a linear discriminative function
5
 
xwxf ⋅=)(
 
by )()( xwsignxh
w
⋅=  , where 
1)( −=⋅xwsign  if 0<⋅xw  and 1)( +=⋅xwsign  
otherwise. Here, the margin of w  at ),(
tt
yx  is 
defined as 
tt
xwy ⋅ . Then if the margin is positive, 
we have a correct prediction with 
tw
yxh =)( , and 
if the margin is negative, we have an error with 
tw
yxh ≠)( . Therefore, given a sequence of 
training examples TtYXyx
tt
...,2,1,),( =×∈ , 
linear classifier learning attemps to find a weight 
vector w  that achieves a positive margin on as 
many examples as possible. 
                                                           
4
 The training example sequence is feed N times for 
better performance. Moreover, this number can con-
trol the maximal affect a training example can pose. 
This is similar to the regulation parameter C in 
SVM, which affects the trade-off between complex-
ity and proportion of non-separable examples. As a 
result, it can be used to control over-fitting and 
robustness. 
5
 )( xw⋅  denotes the dot product of the weight vector 
n
Rw∈  and a given instance 
n
Rx∈ . 
123
The well-known perceptron algorithm, as 
shown in Figure 1, belongs to online learning of 
linear classifiers, where the learning algorithm 
represents its t -th hyposthesis by a weight vector 
n
t
Rw ∈ . At trial t , an online algorithm receives 
an instance 
n
t
Rx ∈ , makes its prediction 
)(
tt
t
xwsign
y
⋅=
∧
 and receives the desired label 
}1,1{ +−∈
t
y . What distinguishes different online 
algorithms is how they update 
t
w  into 
1+t
w  based 
on the example ),(
tt
yx  received at trial t . In 
particular, the perceptron algorithm updates the 
hypothesis by adding a scalar multiple of the 
instance, as shown in Equation 1 of Figure 1, 
when there is an error. Normally, the tradictional 
perceptron algorithm initializes the hypothesis as 
the zero vector 0
1
=w . This is usually the most 
natural choice, lacking any other preference. 
Smoothing 
In order to further improve the performance, we 
iteratively feed the training examples for a possi-
ble better discriminative function. In this paper, 
we have set the maximal iteration number to 10 
for both efficiency and stable performance and 
the final weight vector in the discriminative func-
tion is averaged over those of the discriminative 
functions in the last few iterations (e.g. 5 in this 
paper).  
Bagging 
One more problem with any online classifier 
learning algorithm, including the perceptron al-
gorithm, is that the learned discriminative func-
tion somewhat depends on the feeding order of 
the training examples. In order to eliminate such 
dependence and further improve the perform-
ance, an ensemble technique, called bagging 
(Breiman 1996), is applied in this paper. In bag-
ging, the bootstrap technique is first used to build 
M (e.g. 10 in this paper) replicate sample sets by 
randomly re-sampling with replacement from the 
given training set repeatedly. Then, each training 
sample set is used to train a certain discrimina-
tive function. Finally, the final weight vector in 
the discriminative function is averaged over 
those of the M discriminative functions in the 
ensemble. 
Multi-Class Classification 
Basically, the perceptron algorithm is only for 
binary classification. Therefore, we must extend 
the perceptron algorithms to multi-class 
classification, such as the ACE RDC task. For 
efficiency, we apply the one vs. others strategy, 
which builds K classifiers so as to separate one 
class from all others. However, the outputs for 
the perceptron algorithms of different classes 
may be not directly comparable since any 
positive scalar multiple of the weight vector will 
not affect the actual prediction of a perceptron 
algorithm. For comparability, we map the 
perceptron algorithm output into the probability 
by using an additional sigmoid model: 
)exp(1
1
)|1(
BAf
fyp
++
==          (2) 
where xwf ⋅=  is the output of a perceptron 
algorithm and the coefficients A & B are to be 
trained using the model trust alorithm as 
described in Platt (1999). The final decision of an 
instance in multi-class classification is 
determined by the class which has the maximal 
probability from the corresponding perceptron 
algorithm.  
3.2 Hierarchical Learning Strategy using the 
Perceptron Algorithm 
Assume we have a class hierarchy for a task, e.g. 
the one in the ACE RDC 2003 corpus as shown 
in Table 1 of Section 4.1. The hierarchical learn-
ing strategy explores the inherent commonality 
among related classes in a top-down way. For 
each class in the hierarchy, a linear discrimina-
tive function is determined in a top-down way 
with the lower-level weight vector derived from 
the upper-level weight vector iteratively. This is 
done by initializing the weight vector in training 
the linear discriminative function for the lower-
level class as that of the upper-level class. That 
is, the lower-level discriminative function has the 
preference toward the discriminative function of 
its upper-level class. For an example, let’s look 
at the training of the “Located” relation subtype 
in the class hierarchy as shown in Table 1: 
1) Train the weight vector of the linear 
discriminative function for the “YES” 
relation vs. the “NON” relation with the 
weight vector initialized as the zero vector. 
2) Train the weight vector of the linear 
discriminative function for the “AT” relation 
type vs. all the remaining relation types 
(including the “NON” relation) with the 
weight vector initialized as the weight vector 
of the linear discriminative function for the 
“YES” relation vs. the “NON” relation. 
3) Train the weight vector of the linear 
discriminative function for the “Located” 
relation subtype vs. all the remaining relation 
subtypes under all the relation types 
(including the “NON” relation) with the 
124
weight vector initialized as the weight vector 
of the linear discriminative function for the 
“AT” relation type vs. all the remaining 
relation types. 
4) Return the above trained weight vector as the 
discriminatie function for the “Located” 
relation subtype. 
In this way, the training examples in differ-
ent classes are not treated independently any 
more, and the commonality among related 
classes can be captured via the hierarchical learn-
ing strategy. The intuition behind this strategy is 
that the upper-level class normally has more 
positive training examples than the lower-level 
class so that the corresponding linear discrimina-
tive function can be determined more reliably. In 
this way, the training examples of related classes 
can help in learning a reliable discriminative 
function for a class with only a small amount of 
training examples in a top-down way and thus 
alleviate its data sparseness problem. 
3.3 Building the Class Hierarchy  
We have just described the hierarchical learning 
strategy using a given class hierarchy. Normally, 
a rough class hierarchy can be given manually 
according to human intuition, such as the one in 
the ACE RDC 2003 corpus. In order to explore 
more commonality among sibling classes, we 
make use of binary hierarchical clustering for 
sibling classes at both lowest and all levels. This 
can be done by first using the flat learning strat-
egy to learn the discriminative functions for indi-
vidual classes and then iteratively combining the 
two most related classes using the cosine similar-
ity function between their weight vectors in a 
bottom-up way. The intuition is that related 
classes should have similar hyper-planes to sepa-
rate from other classes and thus have similar 
weight vectors. 
• Lowest-level hybrid: Binary hierarchical 
clustering is only done at the lowest level 
while keeping the upper-level class hierar-
chy. That is, only sibling classes at the low-
est level are hierarchically clustered. 
• All-level hybrid: Binary hierarchical cluster-
ing is done at all levels in a bottom-up way. 
That is, sibling classes at the lowest level are 
hierarchically clustered first and then sibling 
classes at the upper-level. In this way, the bi-
nary class hierarchy can be built iteratively 
in a bottom-up way. 
 
 
4 Experimentation 
This paper uses the ACE RDC 2003 corpus pro-
vided by LDC to train and evaluate the hierarchi-
cal learning strategy. Same as Zhou et al (2005), 
we only model explicit relations and explicitly 
model the argument order of the two mentions 
involved.  
4.1 Experimental Setting 
Type Subtype Freq Bin Type 
AT Based-In 347 Medium 
 Located 2126 Large 
 Residence 308 Medium 
NEAR Relative-Location 201 Medium 
PART Part-Of 947 Large 
 Subsidiary 355 Medium 
 Other 6 Small
ROLE Affiliate-Partner 204 Medium 
 Citizen-Of 328 Medium
 Client 144 Small 
 Founder 26 Small
 General-Staff 1331 Large 
 Management 1242 Large 
 Member 1091 Large 
 Owner 232 Medium 
 Other 158 Small
SOCIAL Associate 91 Small 
 Grandparent 12 Small 
 Other-Personal 85 Small
 Other-Professional 339 Medium 
 Other-Relative 78 Small 
 Parent 127 Small 
 Sibling 18 Small
 Spouse 77 Small 
Table 1: Statistics of relation types and subtypes 
in the training data of the ACE RDC 2003 corpus 
(Note: According to frequency, all the subtypes 
are divided into three bins: large/ middle/ small, 
with 400 as the lower threshold for the large bin 
and 200 as the upper threshold for the small bin). 
The training data consists of 674 documents 
(~300k words) with 9683 relation examples 
while the held-out testing data consists of 97 
documents (~50k words) with 1386 relation ex-
amples. All the experiments are done five times 
on the 24 relation subtypes in the ACE corpus, 
except otherwise specified, with the final per-
formance averaged using the same re-sampling 
with replacement strategy as the one in the bag-
ging technique. Table 1 lists various types and 
subtypes of relations for the ACE RDC 2003 
corpus, along with their occurrence frequency in 
the training data. It shows that this corpus suffers 
from a small amount of annotated data for a few 
subtypes such as the subtype “Founder” under 
the type “ROLE”. 
For comparison, we also adopt the same fea-
ture set as Zhou et al (2005): word, entity type, 
125
mention level, overlap, base phrase chunking, 
dependency tree, parse tree and semantic infor-
mation. 
4.2 Experimental Results 
Table 2 shows the performance of the hierarchi-
cal learning strategy using the existing class hier-
archy in the given ACE corpus and its 
comparison with the flat learning strategy, using 
the perceptron algorithm. It shows that the pure 
hierarchical strategy outperforms the pure flat 
strategy by 1.5 (56.9 vs. 55.4) in F-measure. It 
also shows that further smoothing and bagging 
improve the performance of the hierarchical and 
flat strategies by 0.6 and 0.9 in F-measure re-
spectively. As a result, the final hierarchical 
strategy achieves F-measure of 57.8 and outper-
forms the final flat strategy by 1.8 in F-measure. 
Strategies  P R F 
Flat 58.2 52.8 55.4 
Flat+Smoothing 58.9 53.1 55.9 
Flat+Bagging 59.0 53.1 55.9 
Flat+Both 59.1 53.2 56.0 
Hierarchical 61.9 52.6 56.9 
Hierarchical+Smoothing 62.7 53.1 57.5 
Hierarchical+Bagging 62.9 53.1 57.6 
Hierarchical+Both 63.0 53.4 57.8 
Table 2: Performance of the hierarchical learning 
strategy using the existing class hierarchy and its 
comparison with the flat learning strategy 
Class Hierarchies P R F 
Existing 63.0 53.4 57.8 
Entirely Automatic 63.4 53.1 57.8 
Lowest-level Hybrid 63.6 53.5 58.1 
All-level Hybrid 63.6 53.6 58.2 
Table 3: Performance of the hierarchical learning 
strategy using different class hierarchies 
Table 3 compares the performance of the hi-
erarchical learning strategy using different class 
hierarchies. It shows that, the lowest-level hybrid 
approach, which only automatically updates the 
existing class hierarchy at the lowest level, im-
proves the performance by 0.3 in F-measure 
while further updating the class hierarchy at up-
per levels in the all-level hybrid approach only 
has very slight effect. This is largely due to the 
fact that the major data sparseness problem oc-
curs at the lowest level, i.e. the relation subtype 
level in the ACE corpus. As a result, the final 
hierarchical learning strategy using the class hi-
erarchy built with the all-level hybrid approach 
achieves F-measure of 58.2 in F-measure, which 
outperforms the final flat strategy by 2.2 in F-
measure. In order to justify the usefulness of our 
hierarchical learning strategy when a rough class 
hierarchy is not available and difficult to deter-
mine manually, we also experiment using en-
tirely automatically built class hierarchy (using 
the traditional binary hierarchical clustering algo-
rithm and the cosine similarity measurement) 
without considering the existing class hierarchy. 
Table 3 shows that using automatically built 
class hierarchy performs comparably with using 
only the existing one. 
With the major goal of resolving the data 
sparseness problem for the classes with a small 
amount of training examples, Table 4 compares 
the best-performed hierarchical and flat learning 
strategies on the relation subtypes of different   
training data sizes. Here, we divide various rela-
tion subtypes into three bins: large/middle/small, 
according to their available training data sizes. 
For the ACE RDC 2003 corpus, we use 400 as 
the lower threshold for the large bin
6
 and 200 as 
the upper threshold for the small bin
7
. As a re-
sult, the large/medium/small bin includes 5/8/11 
relation subtypes, respectively. Please see Table 
1 for details. Table 4 shows that the hierarchical 
strategy outperforms the flat strategy by 
1.0/5.1/5.6 in F-measure on the 
large/middle/small bin respectively. This indi-
cates that the hierarchical strategy performs 
much better than the flat strategy for those 
classes with a small or medium amount of anno-
tated examples although the hierarchical strategy 
only performs slightly better by 1.0 and 2.2 in F-
measure than the flat strategy on those classes 
with a large size of annotated corpus and on all 
classes as a whole respectively. This suggests 
that the proposed hierarchical strategy can well 
deal with the data sparseness problem in the 
ACE RDC 2003 corpus.  
An interesting question is about the similar-
ity between the linear discriminative functions 
learned using the hierarchical and flat learning 
strategies.  Table 4 compares the cosine similari-
ties between the weight vectors of the linear dis-
criminative functions using the two strategies for 
different bins, weighted by the training data sizes 
                                                           
6
 The reason to choose this threshold is that no rela-
tion subtype in the ACE RC 2003 corpus has train-
ing examples in between 400 and 900. 
7
 A few minor relation subtypes only have very few 
examples in the testing set. The reason to choose 
this threshold is to guarantee a reasonable number of 
testing examples in the small bin. For the ACE RC 
2003 corpus, using 200 as the upper threshold will 
fill the small bin with about 100 testing examples 
while using 100 will include too few testing exam-
ples for reasonable performance evaluation. 
126
of different relation subtypes. It shows that the 
linear discriminative functions learned using the 
two strategies are very similar (with the cosine 
similarity 0.98) for the relation subtypes belong-
ing to the large bin while the linear discrimina-
tive functions learned using the two strategies are 
not for the relation subtypes belonging to the 
medium/small bin with the cosine similarity 
0.92/0.81 respectively. This means that the use of 
the hierarchical strategy over the flat strategy 
only has very slight change on the linear dis-
criminative functions for those classes with a 
large amount of annotated examples while its 
effect on those with a small amount of annotated 
examples is obvious. This contributes to and ex-
plains (the degree of) the performance difference 
between the two strategies on the different train-
ing data sizes as shown in Table 4. 
Due to the difficulty of building a large an-
notated corpus, another interesting question is 
about the learning curve of the hierarchical learn-
ing strategy and its comparison with the flat 
learning strategy. Figure 2 shows the effect of 
different training data sizes for some major rela-
tion subtypes while keeping all the training ex-
amples of remaining relation subtypes. It shows 
that the hierarchical strategy performs much bet-
ter than the flat strategy when only a small 
amount of training examples is available. It also 
shows that the hierarchical strategy can achieve 
stable performance much faster than the flat 
strategy. Finally, it shows that the ACE RDC 
2003 task suffers from the lack of training exam-
ples. Among the three major relation subtypes, 
only the subtype “Located” achieves steady per-
formance. 
Finally, we also compare our system with the 
previous best-reported systems, such as Kamb-
hatla  (2004) and Zhou et al (2005). Table 5 
shows that our system outperforms the previous 
best-reported system by 2.7 (58.2 vs. 55.5) in F-
measure, largely due to the gain in recall. It indi-
cates that, although support vector machines and 
maximum entropy models always perform better 
than the simple perceptron algorithm in most (if 
not all) applications, the hierarchical learning 
strategy using the perceptron algorithm can eas-
ily overcome the difference and outperforms the 
flat learning strategy using the overwhelming 
support vector machines and maximum entropy 
models in relation extraction, at least on the ACE 
RDC 2003 corpus. 
Large Bin (0.98) Middle Bin (0.92) Small Bin (0.81) Bin Type(cosine similarity) 
P R F P R F P R F 
Flat Strategy 62.3 61.9 62.1 60.8 38.7 47.3 33.0 21.7 26.2 
Hierarchical Strategy 66.4 60.2 63.1 67.6 42.7 52.4 40.2 26.3 31.8 
Table 4: Comparison of the hierarchical and flat learning strategies on the relation subtypes of differ-
ent training data sizes. Notes: the figures in the parentheses indicate the cosine similarities between 
the weight vectors of the linear discriminative functions learned using the two strategies. 
10
20
30
40
50
60
70
2
00
40
0
60
0
80
0
10
0
0
1
20
0
14
0
0
1
60
0
18
0
0
2
00
0
Training Data Size
F
-
m
e
as
ur
e
HS: General-Staff
FS: General-Staff
HS: Part-Of
FS: Part-Of
HS: Located
FS: Located
Figure 2: Learning curve of the hierarchical strategy and its comparison with the flat strategy for some 
major relation subtypes (Note: FS for the flat strategy and HS for the hierarchical strategy) 
Performance System 
P R F 
Our: Perceptron Algorithm + Hierarchical Strategy 
63.6 53.6 58.2 
Zhou et al (2005): SVM + Flat Strategy 63.1 49.5 55.5 
Kambhatla (2004): Maximum Entropy + Flat Strategy 63.5 45.2 52.8 
Table 5: Comparison of our system with other best-reported systems 
127
5 Conclusion 
This paper proposes a novel hierarchical learning 
strategy to deal with the data sparseness problem 
in relation extraction by modeling the common-
ality among related classes. For each class in a 
class hierarchy, a linear discriminative function 
is determined in a top-down way using the per-
ceptron algorithm with the lower-level weight 
vector derived from the upper-level weight vec-
tor. In this way, the upper-level discriminative 
function can effectively guide the lower-level 
discriminative function learning. Evaluation on 
the ACE RDC 2003 corpus shows that the hier-
archical strategy performs much better than the 
flat strategy in resolving the critical data sparse-
ness problem in relation extraction. 
In the future work, we will explore the hier-
archical learning strategy using other machine 
learning approaches besides online classifier 
learning approaches such as the simple percep-
tron algorithm applied in this paper. Moreover, 
just as indicated in Figure 2, most relation sub-
types in the ACE RDC 2003 corpus (arguably 
the largest annotated corpus in relation extrac-
tion) suffer from the lack of training examples. 
Therefore, a critical research in relation extrac-
tion is how to rely on semi-supervised learning 
approaches (e.g. bootstrap) to alleviate its de-
pendency on a large amount of annotated training 
examples and achieve better and steadier per-
formance. Finally, our current work is done when 
NER has been perfectly done. Therefore, it 
would be interesting to see how imperfect NER 
affects the performance in relation extraction. 
This will be done by integrating the relation ex-
traction system with our previously developed 
NER system as described in Zhou and Su (2002). 
References  
ACE. (2000-2005). Automatic Content Extraction. 
http://www.ldc.upenn.edu/Projects/ACE/  
Bunescu R. & Mooney R.J. (2005a). A shortest 
path dependency kernel for relation extraction. 
HLT/EMNLP’2005: 724-731. 6-8 Oct 2005. 
Vancover, B.C. 
Bunescu R. & Mooney R.J. (2005b). Subsequence 
Kernels for Relation Extraction  NIPS’2005. 
Vancouver, BC, December 2005  
Breiman L. (1996) Bagging Predictors. Machine 
Learning, 24(2): 123-140. 
Collins M. (1999).  Head-driven statistical models 
for natural language parsing. Ph.D. Dissertation, 
University of Pennsylvania. 
Culotta A. and Sorensen J. (2004). Dependency 
tree kernels for relation extraction. ACL’2004. 
423-429. 21-26 July 2004. Barcelona, Spain. 
Kambhatla N. (2004). Combining lexical, syntactic 
and semantic features with Maximum Entropy 
models for extracting relations. 
ACL’2004(Poster). 178-181. 21-26 July 2004. 
Barcelona, Spain. 
Miller G.A. (1990). WordNet: An online lexical 
database. International Journal of Lexicography. 
3(4):235-312. 
Miller S., Fox H., Ramshaw L. and Weischedel R. 
(2000). A novel use of statistical parsing to ex-
tract information from text. ANLP’2000. 226-
233. 29 April  - 4 May 2000, Seattle, USA 
MUC-7. (1998). Proceedings of the 7
th
 Message 
Understanding Conference (MUC-7). Morgan 
Kaufmann, San Mateo, CA. 
Platt J. 1999. Probabilistic Outputs for Support 
Vector Machines and Comparisions to regular-
ized Likelihood Methods. In Advances in Large 
Margin Classifiers. Edited by Smola .J., Bartlett 
P., Scholkopf B. and Schuurmans D. MIT Press. 
Roth D. and Yih W.T. (2002). Probabilistic reason-
ing for entities and relation recognition. CoL-
ING’2002. 835-841.26-30 Aug 2002. Taiwan. 
Zelenko D., Aone C. and Richardella. (2003). Ker-
nel methods for relation extraction. Journal of 
Machine Learning Research. 3(Feb):1083-1106. 
Zhang M., Su J., Wang D.M., Zhou G.D. and Tan 
C.L. (2005). Discovering Relations from a Large 
Raw Corpus Using Tree Similarity-based Clus-
tering, IJCNLP’2005, Lecture Notes in 
Computer Science (LNCS 3651). 378-389. 11-16 
Oct 2005. Jeju Island, South Korea. 
Zhao S.B. and Grisman R. 2005. Extracting rela-
tions with integrated information using kernel 
methods. ACL’2005: 419-426. Univ of Michi-
gan-Ann Arbor， USA， 25-30 June 2005. 
Zhou G.D. and Su Jian. Named Entity Recogni-
tion Using a HMM-based Chunk Tagger, 
ACL’2002. pp473-480. Philadelphia. July 
2002.  
Zhou G.D., Su J. Zhang J. and Zhang M. (2005). 
Exploring various knowledge in relation extrac-
tion. ACL’2005. 427-434. 25-30 June, Ann Ar-
bor, Michgan, USA. 
128
