Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 724–731, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Shortest Path Dependency Kernel for Relation Extraction
Razvan C. Bunescu and Raymond J. Mooney
Department of Computer Sciences
University of Texas at Austin
1 University Station C0500
Austin, TX 78712
razvan,mooney@cs.utexas.edu
Abstract
We present a novel approach to relation
extraction, based on the observation that
the information required to assert a rela-
tionship between two named entities in
the same sentence is typically captured
by the shortest path between the two en-
tities in the dependency graph. Exper-
iments on extracting top-level relations
from the ACE (Automated Content Ex-
traction) newspaper corpus show that the
new shortest path dependency kernel out-
performs a recent approach based on de-
pendency tree kernels.
1 Introduction
One of the key tasks in natural language process-
ing is that of Information Extraction (IE), which is
traditionally divided into three subproblems: coref-
erence resolution, named entity recognition, and
relation extraction. Consequently, IE corpora are
typically annotated with information corresponding
to these subtasks (MUC (Grishman, 1995), ACE
(NIST, 2000)), facilitating the development of sys-
tems that target only one or a subset of the three
problems. In this paper we focus exclusively on ex-
tracting relations between predefined types of en-
tities in the ACE corpus. Reliably extracting re-
lations between entities in natural-language docu-
ments is still a difficult, unsolved problem, whose
inherent difficulty is compounded by the emergence
of new application domains, with new types of nar-
rative that challenge systems developed for previous
well-studied domains. The accuracy level of cur-
rent syntactic and semantic parsers on natural lan-
guage text from different domains limit the extent
to which syntactic and semantic information can be
used in real IE systems. Nevertheless, various lines
of work on relation extraction have shown experi-
mentally that the use of automatically derived syn-
tactic information can lead to significant improve-
ments in extraction accuracy. The amount of syntac-
tic knowledge used in IE systems varies from part-
of-speech only (Ray and Craven, 2001) to chunking
(Ray and Craven, 2001) to shallow parse trees (Ze-
lenko et al., 2003) to dependency trees derived from
full parse trees (Culotta and Sorensen, 2004). Even
though exhaustive experiments comparing the per-
formance of a relation extraction system based on
these four levels of syntactic information are yet to
be conducted, a reasonable assumption is that the ex-
traction accuracy increases with the amount of syn-
tactic information used. The performance however
depends not only on the amount of syntactic infor-
mation, but also on the details of the exact models
using this information. Training a machine learn-
ing system in a setting where the information used
for representing the examples is only partially rele-
vant to the actual task often leads to overfitting. It is
therefore important to design the IE system so that
the input data is stripped of unnecessary features as
much as possible. In the case of the tree kernels
from (Zelenko et al., 2003; Culotta and Sorensen,
2004), the authors reduce each relation example to
the smallest subtree in the parse or dependency tree
that includes both entities. We will show in this
paper that increased extraction performance can be
724
obtained by designing a kernel method that uses an
even smaller part of the dependency structure – the
shortest path between the two entities in the undi-
rected version of the dependency graph.
2 Dependency Graphs
Let e1 and e2 be two entities mentioned in the same
sentence such that they are observed to be in a re-
lationship R, i.e R(e1,e2) = 1. For example, R
can specify that entity e1 is LOCATED (AT) entity
e2. Figure 1 shows two sample sentences from ACE,
with entity mentions in bold. Correspondingly, the
first column in Table 1 lists the four relations of type
LOCATED that need to be extracted by the IE sys-
tem. We assume that a relation is to be extracted
only between entities mentioned in the same sen-
tence, and that the presence or absence of a relation
is independent of the text preceding or following the
sentence. This means that only information derived
from the sentence including the two entities will be
relevant for relation extraction. Furthermore, with
each sentence we associate its dependency graph,
with words figured as nodes and word-word depen-
dencies figured as directed edges, as shown in Fig-
ure 1. A subset of these word-word dependencies
capture the predicate-argument relations present in
the sentence. Arguments are connected to their tar-
get predicates either directly through an arc point-
ing to the predicate (’troops → raided’), or indirectly
through a preposition or infinitive particle (’warning
← to ← stop’). Other types of word-word dependen-
cies account for modifier-head relationships present
in adjective-noun compounds (’several → stations’),
noun-noun compounds (’pumping → stations’), or
adverb-verb constructions (’recently → raided’). In
Figure 1 we show the full dependency graphs for two
sentences from the ACE newspaper corpus.
Word-word dependencies are typically catego-
rized in two classes as follows:
• [Local Dependencies] These correspond to lo-
cal predicate-argument (or head-modifier) con-
structions such as ’troops → raided’, or ’pump-
ing → stations’ in Figure 1.
• [Non-local Dependencies] Long-distance de-
pendencies arise due to various linguistic con-
structions such as coordination, extraction,
raising and control. In Figure 1, among non-
local dependencies are ’troops → warning’, or
’ministers → preaching’.
A Context Free Grammar (CFG) parser can be
used to extract local dependencies, which for each
sentence form a dependency tree. Mildly context
sensitive formalisms such as Combinatory Catego-
rial Grammar (CCG) (Steedman, 2000) model word-
word dependencies more directly and can be used to
extract both local and long-distance dependencies,
giving rise to a directed acyclic graph, as illustrated
in Figure 1.
3 The Shortest Path Hypothesis
If e1 and e2 are two entities mentioned in the same
sentence such that they are observed to be in a rela-
tionship R, our hypothesis stipulates that the con-
tribution of the sentence dependency graph to es-
tablishing the relationship R(e1,e2) is almost exclu-
sively concentrated in the shortest path between e1
and e2 in the undirected version of the dependency
graph.
If entities e1 and e2 are arguments of the same
predicate, then the shortest path between them will
pass through the predicate, which may be con-
nected directly to the two entities, or indirectly
through prepositions. If e1 and e2 belong to different
predicate-argument structures that share a common
argument, then the shortest path will pass through
this argument. This is the case with the shortest path
between ’stations’ and ’workers’ in Figure 1, pass-
ing through ’protesters’, which is an argument com-
mon to both predicates ’holding’ and ’seized’. In
Table 1 we show the paths corresponding to the four
relation instances encoded in the ACE corpus for the
two sentences from Figure 1. All these paths sup-
port the LOCATED relationship. For the first path, it
is reasonable to infer that if a PERSON entity (e.g.
’protesters’) is doing some action (e.g. ’seized’) to
a FACILITY entity (e.g. ’station’), then the PERSON
entity is LOCATED at that FACILITY entity. The sec-
ond path captures the fact that the same PERSON
entity (e.g. ’protesters’) is doing two actions (e.g.
’holding’ and ’seized’) , one action to a PERSON en-
tity (e.g. ’workers’), and the other action to a FACIL-
ITY entity (e.g. ’station’). A reasonable inference in
this case is that the ’workers’ are LOCATED at the
725
S1 =
=S2
Protesters stations workers
Troops churches ministers
seized   several   pumping , holding   127   Shell hostage .
recently   have   raided , warning to   stop   preaching .
Figure 1: Sentences as dependency graphs.
Relation Instance Shortest Path in Undirected Dependency Graph
S1: protesters AT stations protesters −→ seized ←− stations
S1: workers AT stations workers −→ holding ←− protesters −→ seized ←− stations
S2: troops AT churches troops −→ raided ←− churches
S2: ministers AT churches ministers −→ warning ←− troops −→ raided ←− churches
Table 1: Shortest Path representation of relations.
’station’.
In Figure 2 we show three more examples of the
LOCATED (AT) relationship as dependency paths
created from one or two predicate-argument struc-
tures. The second example is an interesting case,
as it illustrates how annotation decisions are accom-
modated in our approach. Using a reasoning similar
with that from the previous paragraph, it is reason-
able to infer that ’troops’ are LOCATED in ’vans’,
and that ’vans’ are LOCATED in ’city’. However,
because ’vans’ is not an ACE markable, it cannot
participate in an annotated relationship. Therefore,
’troops’ is annotated as being LOCATED in ’city’,
which makes sense due to the transitivity of the rela-
tion LOCATED. In our approach, this leads to short-
est paths that pass through two or more predicate-
argument structures.
The last relation example is a case where there ex-
ist multiple shortest paths in the dependency graph
between the same two entities – there are actually
two different paths, with each path replicated into
three similar paths due to coordination. Our current
approach considers only one of the shortest paths,
nevertheless it seems reasonable to investigate using
all of them as multiple sources of evidence for rela-
tion extraction.
There may be cases where e1 and e2 belong
to predicate-argument structures that have no argu-
ment in common. However, because the depen-
dency graph is always connected, we are guaran-
teed to find a shortest path between the two enti-
ties. In general, we shall find a shortest sequence of
predicate-argument structures with target predicates
P1,P2,...,Pn such that e1 is an argument of P1, e2 is
an argument of Pn, and any two consecutive predi-
cates Pi and Pi+1 share a common argument (where
by “argument” we mean both arguments and com-
plements).
4 Learning with Dependency Paths
The shortest path between two entities in a depen-
dency graph offers a very condensed representation
of the information needed to assess their relation-
ship. A dependency path is represented as a se-
quence of words interspersed with arrows that in-
726
(1) He had no regrets for his actions in Brcko.
his→actions←in←Brcko
(2) U.S. troops today acted for the first time to capture an
alleged Bosnian war criminal, rushing from unmarked vans
parked in the northern Serb-dominated city of Bijeljina.
troops→rushing←from←vans→parked←in←city
(3) Jelisic created an atmosphere of terror at the camp by
killing, abusing and threatening the detainees.
detainees→killing←Jelisic→created←at←camp
detainees→abusing←Jelisic→created←at←camp
detainees→threatning←Jelisic→created←at←camp
detainees→killing→by→created←at←camp
detainees→abusing→by→created←at←camp
detainees→threatening→by→created←at←camp
Figure 2: Relation examples.
dicate the orientation of each dependency, as illus-
trated in Table 1. These paths however are com-
pletely lexicalized and consequently their perfor-
mance will be limited by data sparsity. We can al-
leviate this by categorizing words into classes with
varying degrees of generality, and then allowing
paths to use both words and their classes. Examples
of word classes are part-of-speech (POS) tags and
generalizations over POS tags such as Noun, Active
Verb or Passive Verb. The entity type is also used for
the two ends of the dependency path. Other poten-
tially useful classes might be created by associating
with each noun or verb a set of hypernyms corre-
sponding to their synsets in WordNet.
The set of features can then be defined as a
Cartesian product over these word classes, as illus-
trated in Figure 3 for the dependency path between
’protesters’ and ’station’ in sentence S1. In this rep-
resentation, sparse or contiguous subsequences of
nodes along the lexicalized dependency path (i.e.
path fragments) are included as features simply by
replacing the rest of the nodes with their correspond-
ing generalizations.
The total number of features generated by this de-
pendency path is 4×1×3×1×4, and some of them
are listed in Table 2.


protesters
NNS
Noun
PERSON

×[→]×
bracketleftBigg seized
VBD
Verb
bracketrightBigg
×[←]×


stations
NNS
Noun
FACILITY


Figure 3: Feature generation from dependency path.
protesters → seized ← stations
Noun → Verb ← Noun
PERSON → seized ← FACILITY
PERSON → Verb ← FACILITY
... (48 features)
Table 2: Sample Features.
For verbs and nouns (and their respective word
classes) occurring along a dependency path we also
use an additional suffix ’(-)’ to indicate a negative
polarity item. In the case of verbs, this suffix is used
when the verb (or an attached auxiliary) is modi-
fied by a negative polarity adverb such as ’not’ or
’never’. Nouns get the negative suffix whenever
they are modified by negative determiners such as
’no’, ’neither’ or ’nor’. For example, the phrase “He
never went to Paris” is associated with the depen-
dency path ’He → went(-) ← to ← Paris’.
Explicitly creating for each relation example a
vector with a position for each dependency path fea-
ture is infeasible, due to the high dimensionality of
the feature space. Here we can exploit dual learn-
ing algorithms that process examples only via com-
puting their dot-products, such as the Support Vec-
tor Machines (SVMs) (Vapnik, 1998; Cristianini
and Shawe-Taylor, 2000). These dot-products be-
tween feature vectors can be efficiently computed
through a kernel function, without iterating over all
the corresponding features. Given the kernel func-
tion, the SVM learner tries to find a hyperplane that
separates positive from negative examples and at the
same time maximizes the separation (margin) be-
tween them. This type of max-margin separator has
been shown both theoretically and empirically to re-
sist overfitting and to provide good generalization
performance on unseen examples.
Computing the dot-product (i.e. kernel) between
two relation examples amounts to calculating the
727
number of common features of the type illustrated
in Table 2. If x = x1x2...xm and y = y1y2...yn are
two relation examples, where xi denotes the set of
word classes corresponding to position i (as in Fig-
ure 3), then the number of common features between
x and y is computed as in Equation 1.
K(x,y) =
braceleftBigg
0, m negationslash= nproducttext
n
i=1 c(xi,yi), m = n
(1)
where c(xi,yi) = |xi∩yi| is the number of common
word classes between xi and yi.
This is a simple kernel, whose computation takes
O(n) time. If the two paths have different lengths,
they correspond to different ways of expressing a re-
lationship – for instance, they may pass through a
different number of predicate argument structures.
Consequently, the kernel is defined to be 0 in this
case. Otherwise, it is the product of the number of
common word classes at each position in the two
paths. As an example, let us consider two instances
of the LOCATED relationship:
1. ’his actions in Brcko’, and
2. ’his arrival in Beijing’.
Their corresponding dependency paths are:
1. ’his → actions ← in ← Brcko’, and
2. ’his → arrival ← in ← Beijing’.
Their representation as a sequence of sets of word
classes is given by:
1. x = [x1 x2 x3 x4 x5 x6 x7], where x1 =
{his, PRP, PERSON}, x2 = {→}, x3 = {actions,
NNS, Noun}, x4 = {←}, x5 = {in, IN}, x6 =
{←}, x7 = {Brcko, NNP, Noun, LOCATION}
2. y = [y1 y2 y3 y4 y5 y6 y7], where y1 = {his,
PRP, PERSON}, y2 = {→}, y3 = {arrival, NN,
Noun}, y4 = {←}, y5 = {in, IN}, y6 = {←}, y7
= {Beijing, NNP, Noun, LOCATION}
Based on the formula from Equation 1, the kernel is
computed as K(x,y) = 3×1×1×1×2×1×3 = 18.
We use this relation kernel in conjunction with
SVMs in order to find decision hyperplanes that best
separate positive examples from negative examples.
We modified the LibSVM1 package for SVM learn-
ing by plugging in the kernel described above, and
used its default one-against-one implementation for
multiclass classification.
5 Experimental Evaluation
We applied the shortest path dependency kernel to
the problem of extracting top-level relations from
the ACE corpus (NIST, 2000), the version used
for the September 2002 evaluation. The training
part of this dataset consists of 422 documents, with
a separate set of 97 documents allocated for test-
ing. This version of the ACE corpus contains three
types of annotations: coreference, named entities
and relations. Entities can be of the type PERSON,
ORGANIZATION, FACILITY, LOCATION, and GEO-
POLITICAL ENTITY. There are 5 general, top-level
relations: ROLE, PART, LOCATED, NEAR, and SO-
CIAL. The ROLE relation links people to an organi-
zation to which they belong, own, founded, or pro-
vide some service. The PART relation indicates sub-
set relationships, such as a state to a nation, or a sub-
sidiary to its parent company. The AT relation indi-
cates the location of a person or organization at some
location. The NEAR relation indicates the proxim-
ity of one location to another. The SOCIAL rela-
tion links two people in personal, familial or profes-
sional relationships. Each top-level relation type is
further subdivided into more fine-grained subtypes,
resulting in a total of 24 relation types. For exam-
ple, the LOCATED relation includes subtypes such
as LOCATED-AT, BASED-IN, and RESIDENCE. In
total, there are 7,646 intra-sentential relations, of
which 6,156 are in the training data and 1,490 in the
test data.
We assume that the entities and their labels are
known. All preprocessing steps – sentence segmen-
tation, tokenization, and POS tagging – were per-
formed using the OpenNLP2 package.
5.1 Extracting dependencies using a CCG
parser
CCG (Steedman, 2000) is a type-driven theory of
grammar where most language-specific aspects of
the grammar are specified into lexicon. To each lex-
1URL:http://www.csie.ntu.edu.tw/˜cjlin/libsvm/
2URL: http://opennlp.sourceforge.net
728
ical item corresponds a set of syntactic categories
specifying its valency and the directionality of its
arguments. For example, the words from the sen-
tence “protesters seized several stations” are mapped
in the lexicon to the following categories:
protesters : NP
seized : (S\NP)/NP
several : NP/NP
stations : NP
The transitive verb ’seized’ expects two arguments:
a noun phrase to the right (the object) and another
noun phrase to the left (the subject). Similarly, the
adjective ’several’ expects a noun phrase to its right.
Depending on whether its valency is greater than
zero or not, a syntactic category is called a functor
or an argument. In the example above, ’seized’ and
’several’ are functors, while ’protesters’ and ’sta-
tions’ are arguments.
Syntactic categories are combined using a small
set of typed combinatory rules such as functional ap-
plication, composition and type raising. In Table 3
we show a sample derivation based on three func-
tional applications.
protesters seized several stations
NP (S\NP)/NP NP/NP NP
NP (S\NP)/NP NP
NP S\NP
S
Table 3: Sample derivation.
In order to obtain CCG derivations for all sen-
tences in the ACE corpus, we used the CCG
parser introduced in (Hockenmaier and Steedman,
2002)3. This parser also outputs a list of dependen-
cies, with each dependency represented as a 4-tuple
〈f,a,wf,wa〉, where f is the syntactic category of
the functor, a is the argument number, wf is the head
word of the functor, and wa is the head word of the
argument. For example, the three functional appli-
cations from Table 3 result in the functor-argument
dependencies enumerated below in Table 4.
3URL:http://www.ircs.upenn.edu/˜juliahr/Parser/
f a wf wa
NP/NP 1 ’several’ ’stations’
(S\NP)/NP 2 ’seized’ ’stations’
(S\NP)/NP 1 ’seized’ ’protesters’
Table 4: Sample dependencies.
Because predicates (e.g. ’seized’) and adjuncts
(e.g. ’several’) are always represented as func-
tors, while complements (e.g. ’protesters’ and ’sta-
tions’) are always represented as arguments, it is
straightforward to transform a functor-argument de-
pendency into a head-modifier dependency. The
head-modifier dependencies corresponding to the
three functor-argument dependencies in Table 4 are:
’protesters → seized’, ’stations → seized’, and ’sev-
eral → stations’.
Special syntactic categories are assigned in CCG
to lexical items that project unbounded dependen-
cies, such as the relative pronouns ’who’, ’which’
and ’that’. Coupled with a head-passing mechanism,
these categories allow the extraction of long-range
dependencies. Together with the local word-word
dependencies, they create a directed acyclic depen-
dency graph for each parsed sentence, as shown in
Figure 1.
5.2 Extracting dependencies using a CFG
parser
Local dependencies can be extracted from a CFG
parse tree using simple heuristic rules for finding
the head child for each type of constituent. Alter-
natively, head-modifier dependencies can be directly
output by a parser whose model is based on lexical
dependencies. In our experiments, we used the full
parse output from Collins’ parser (Collins, 1997), in
which every non-terminal node is already annotated
with head information. Because local dependencies
assemble into a tree for each sentence, there is only
one (shortest) path between any two entities in a de-
pendency tree.
5.3 Experimental Results
A recent approach to extracting relations is de-
scribed in (Culotta and Sorensen, 2004). The au-
thors use a generalized version of the tree kernel
from (Zelenko et al., 2003) to compute a kernel over
729
relation examples, where a relation example consists
of the smallest dependency tree containing the two
entities of the relation. Precision and recall values
are reported for the task of extracting the 5 top-level
relations in the ACE corpus under two different sce-
narios:
– [S1] This is the classic setting: one multi-class
SVM is learned to discriminate among the 5 top-
level classes, plus one more class for the no-relation
cases.
– [S2] Because of the highly skewed data distribu-
tion, the recall of the SVM approach in the first sce-
nario is very low. In (Culotta and Sorensen, 2004)
the authors propose doing relation extraction in two
steps: first, one binary SVM is trained for rela-
tion detection, which means that all positive rela-
tion instances are combined into one class. Then the
thresholded output of this binary classifier is used as
training data for a second multi-class SVM, which is
trained for relation classification. The same kernel
is used in both stages.
We present in Table 5 the performance of our
shortest path (SP) dependency kernel on the task of
relation extraction from ACE, where the dependen-
cies are extracted using either a CCG parser (SP-
CCG), or a CFG parser (SP-CFG). We also show
the results presented in (Culotta and Sorensen, 2004)
for their best performing kernel K4 (a sum between
a bag-of-words kernel and their dependency kernel)
under both scenarios.
Method Precision Recall F-measure
(S1) SP-CCG 67.5 37.2 48.0
(S1) SP-CFG 71.1 39.2 50.5
(S1) K4 70.3 26.3 38.0
(S2) SP-CCG 63.7 41.4 50.2
(S2) SP-CFG 65.5 43.8 52.5
(S2) K4 67.1 35.0 45.8
Table 5: Extraction Performance on ACE.
The shortest-path dependency kernels outperform
the dependency kernel from (Culotta and Sorensen,
2004) in both scenarios, with a more significant dif-
ference for SP-CFG. An error analysis revealed that
Collins’ parser was better at capturing local depen-
dencies, hence the increased accuracy of SP-CFG.
Another advantage of our shortest-path dependency
kernels is that their training and testing are very fast
– this is due to representing the sentence as a chain
of dependencies on which a fast kernel can be com-
puted. All the four SP kernels from Table 5 take
between 2 and 3 hours to train and test on a 2.6GHz
Pentium IV machine.
To avoid numerical problems, we constrained the
dependency paths to pass through at most 10 words
(as observed in the training data) by setting the ker-
nel to 0 for longer paths. We also tried the alterna-
tive solution of normalizing the kernel, however this
led to a slight decrease in accuracy. Having longer
paths give larger kernel scores in the unnormalized
version does not pose a problem because, by defi-
nition, paths of different lengths correspond to dis-
joint sets of features. Consequently, the SVM algo-
rithm will induce lower weights for features occur-
ring in longer paths, resulting in a linear separator
that works irrespective of the size of the dependency
paths.
6 Related Work
In (Zelenko et al., 2003), the authors do relation
extraction using a tree kernel defined over shallow
parse tree representations of sentences. The same
tree kernel is slightly generalized in (Culotta and
Sorensen, 2004) and used in conjunction with de-
pendency trees. In both approaches, a relation in-
stance is defined to be the smallest subtree in the
parse or dependency tree that includes both entities.
In this paper we argued that the information relevant
to relation extraction is almost entirely concentrated
in the shortest path in the dependency tree, leading to
an even smaller representation. Another difference
between the tree kernels above and our new kernel
is that the tree kernels used for relation extraction
are opaque i.e. the semantics of the dimensions in
the corresponding Hilbert space is not obvious. For
the shortest-path kernels, the semantics is known by
definition: each path feature corresponds to a dimen-
sion in the Hilbert space. This transparency allows
us to easily restrict the types of patterns counted by
the kernel to types that we deem relevant for relation
extraction. The tree kernels are also more time con-
suming, especially in the sparse setting, where they
count sparse subsequences of children common to
nodes in the two trees. In (Zelenko et al., 2003), the
730
tree kernel is computed in O(mn) time, where m
and n are the number of nodes in the two trees. This
changes to O(mn3) in the sparse setting.
Our shortest-path intuition bears some similar-
ity with the underlying assumption of the relational
pathfinding algorithm from (Richards and Mooney,
1992) : “in most relational domains, important con-
cepts will be represented by a small number of fixed
paths among the constants defining a positive in-
stance – for example, the grandparent relation is de-
fined by a single fixed path consisting of two parent
relations.” We can see this happening also in the task
of relation extraction from ACE, where “important
concepts” are the 5 types of relations, and the “con-
stants” defining a positive instance are the 5 types of
entities.
7 Future Work
Local and non-local (deep) dependencies are equally
important for finding relations. In this paper we tried
extracting both types of dependencies using a CCG
parser, however another approach is to recover deep
dependencies from syntactic parses, as in (Camp-
bell, 2004; Levy and Manning, 2004). This may
have the advantage of preserving the quality of lo-
cal dependencies while completing the representa-
tion with non-local dependencies.
Currently, the method assumes that the named en-
tities are known. A natural extension is to automati-
cally extract both the entities and their relationships.
Recent research (Roth and Yih, 2004) indicates that
integrating entity recognition with relation extrac-
tion in a global model that captures the mutual influ-
ences between the two tasks can lead to significant
improvements in accuracy.
8 Conclusion
We have presented a new kernel for relation extrac-
tion based on the shortest-path between the two rela-
tion entities in the dependency graph. Comparative
experiments on extracting top-level relations from
the ACE corpus show significant improvements over
a recent dependency tree kernel.
9 Acknowledgements
This work was supported by grants IIS-0117308 and
IIS-0325116 from the NSF.
References
Richard Campbell. 2004. Using linguistic principles to recover
empty categories. In Proceedings of the 42nd Annual Meet-
ing of the Association for Computational Linguistics (ACL-
04), pages 645–652, Barcelona, Spain, July.
Michael J. Collins. 1997. Three generative, lexicalised mod-
els for statistical parsing. In Proceedings of the 35th An-
nual Meeting of the Association for Computational Linguis-
tics (ACL-97), pages 16–23.
Nello Cristianini and John Shawe-Taylor. 2000. An Introduc-
tion to Support Vector Machines and Other Kernel-based
Learning Methods. Cambridge University Press.
Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree
kernels for relation extraction. In Proceedings of the 42nd
Annual Meeting of the Association for Computational Lin-
guistics (ACL-04), Barcelona, Spain, July.
Ralph Grishman. 1995. Message Understanding Conference 6.
http://cs.nyu.edu/cs/faculty/grishman/muc6.html.
Julia Hockenmaier and Mark Steedman. 2002. Generative
models for statistical parsing with combinatory categorial
grammar. In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics (ACL-2002),
pages 335–342, Philadelphia, PA.
Roger Levy and Christopher Manning. 2004. Deep dependen-
cies from context-free statistical parsers: Correcting the sur-
face dependency approximation. In Proceedings of the 42nd
Annual Meeting of the Association for Computational Lin-
guistics (ACL-04), pages 327–334, Barcelona, Spain, July.
NIST. 2000. ACE – Automatic Content Extraction.
http://www.nist.gov/speech/tests/ace.
Soumya Ray and Mark Craven. 2001. Representing sentence
structure in hidden Markov models for information extrac-
tion. In Proceedings of the Seventeenth International Joint
Conference on Artificial Intelligence (IJCAI-2001), pages
1273–1279, Seattle, WA.
Bradley L. Richards and Raymond J. Mooney. 1992. Learning
relations by pathfinding. In Proceedings of the Tenth Na-
tional Conference on Artificial Intelligence (AAAI-92), pages
50–55, San Jose, CA, July.
D. Roth and W. Yih. 2004. A linear programming formula-
tion for global inference in natural language tasks. In Pro-
ceedings of the Annual Conference on Computational Natu-
ral Language Learning (CoNLL), pages 1–8, Boston, MA.
Mark Steedman. 2000. The Syntactic Process. The MIT Press,
Cambridge, MA.
Vladimir N. Vapnik. 1998. Statistical Learning Theory. John
Wiley & Sons.
D. Zelenko, C. Aone, and A. Richardella. 2003. Kernel meth-
ods for relation extraction. Journal of Machine Learning
Research, 3:1083–1106.
731
