Decomposition Kernels for Natural Language Processing
Fabrizio Costa Sauro Menchetti Alessio Ceroni
Dipartimento di Sistemi e Informatica,
Universit`a degli Studi di Firenze,
via di S. Marta 3, 50139 Firenze, Italy
{costa,menchett,passerini,aceroni,p-f} AT dsi.unifi.it
Andrea Passerini Paolo Frasconi
Abstract
We propose a simple solution to the se-
quence labeling problem based on an ex-
tension of weighted decomposition ker-
nels. We additionally introduce a multi-
instance kernel approach for representing
lexical word sense information. These
new ideas have been preliminarily tested
on named entity recognition and PP at-
tachment disambiguation. We finally sug-
gest how these techniques could be poten-
tially merged using a declarative formal-
ism that may provide a basis for the inte-
gration of multiple sources of information
when using kernel-based learning in NLP.
1 Introduction
Many tasks related to the analysis of natural lan-
guage are best solved today by machine learning
and other data driven approaches. In particular,
several subproblems related to information extrac-
tion can be formulated in the supervised learning
framework, where statistical learning has rapidly
become one of the preferred methods of choice.
A common characteristic of many NLP problems
is the relational and structured nature of the rep-
resentations that describe data and that are inter-
nally used by various algorithms. Hence, in or-
der to develop effective learning algorithms, it is
necessary to cope with the inherent structure that
characterize linguistic entities. Kernel methods
(see e.g. Shawe-Taylor and Cristianini, 2004) are
well suited to handle learning tasks in structured
domains as the statistical side of a learning algo-
rithm can be naturally decoupled from any rep-
resentational details that are handled by the ker-
nel function. As a matter of facts, kernel-based
statistical learning has gained substantial impor-
tance in the NLP field. Applications are numerous
and diverse and include for example refinement
of statistical parsers (Collins and Duffy, 2002),
tagging named entities (Cumby and Roth, 2003;
Tsochantaridis et al., 2004), syntactic chunking
(Daum´e III and Marcu, 2005), extraction of rela-
tions between entities (Zelenko et al., 2003; Cu-
lotta and Sorensen, 2004), semantic role label-
ing (Moschitti, 2004). The literature is rich with
examples of kernels on discrete data structures
such as sequences (Lodhi et al., 2002; Leslie et
al., 2002; Cortes et al., 2004), trees (Collins and
Duffy, 2002; Kashima and Koyanagi, 2002), and
annotated graphs (G¨artner, 2003; Smola and Kon-
dor, 2003; Kashima et al., 2003; Horv´ath et al.,
2004). Kernels of this kind can be almost in-
variably described as special cases of convolu-
tion and other decomposition kernels (Haussler,
1999). Thanks to its generality, decomposition
is an attractive and flexible approach for defining
the similarity between structured objects starting
from the similarity between smaller parts. How-
ever, excessively large feature spaces may result
from the combinatorial growth of the number of
distinct subparts with their size. When too many
dimensions in the feature space are irrelevant, the
Gram matrix will be nearly diagonal (Sch¨olkopf
et al., 2002), adversely affecting generalization in
spite of using large margin classifiers (Ben-David
et al., 2002). Possible cures include extensive use
of prior knowledge to guide the choice of rele-
vant parts (Cumby and Roth, 2003; Frasconi et al.,
2004), the use of feature selection (Suzuki et al.,
2004), and soft matches (Saunders et al., 2002). In
(Menchetti et al., 2005) we have shown that better
generalization can indeed be achieved by avoid-
ing hard comparisons between large parts. In a
17
weighteddecompositionkernel(WDK)onlysmall
parts are matched, whereas the importance of the
match is determined by comparing the sufficient
statistics of elementary probabilistic models fit-
ted on larger contextual substructures. Here we
introduce a position-dependent version of WDK
that can solve sequence labeling problems without
searchingtheoutputspace, asrequiredbyotherre-
cently proposed kernel-based solutions (Tsochan-
taridis et al., 2004; Daum´e III and Marcu, 2005).
The paper is organized as follows. In the next
two sections we briefly review decomposition ker-
nels and its weighted variant. In Section 4 we in-
troduce a version of WDK for solving supervised
sequence labeling tasks and report a preliminary
evaluation on a named entity recognition problem.
In Section 5 we suggest a novel multi-instance ap-
proach for representing WordNet information and
present an application to the PP attachment am-
biguity resolution problem. In Section 6 we dis-
cuss how these ideas could be merged using a
declarative formalism in order to integrate mul-
tiple sources of information when using kernel-
based learning in NLP.
2 Decomposition Kernels
An R-decomposition structure (Haussler, 1999;
Shawe-Taylor and Cristianini, 2004) on a set X is
a triple R = 〈 vectorX,R,vectork〉 where vectorX = (X1,...,XD)
is a D–tuple of non–empty subsets of X, R is
a finite relation on X1 × ··· × XD × X, and
vectork = (k1,...,kD) is a D–tuple of positive defi-
nite kernel functions kd : Xd ×Xd mapsto→ IR. R(vectorx,x)
is true iff vectorx is a tuple of “parts” for x — i.e. vectorx
is a decomposition of x. Note that this defini-
tion of “parts” is very general and does not re-
quire the parthood relation to obey any specific
mereological axioms, such as those that will be
introduced in Section 6. For any x ∈ X, let
R−1(x) = {(x1,...,xD) ∈ vectorX : R(vectorx,x)} de-
note the multiset of all possible decompositions1
of x. A decomposition kernel is then defined as
the multiset kernel between the decompositions:
KR(x,xprime) =
summationdisplay
vectorx ∈ R−1(x)
vectorxprime ∈ R−1(xprime)
Dproductdisplay
d=1
κd(xd,xprimed) (1)
1Decomposition examples in the string domain include
taking all the contiguous fixed-length substrings or all the
possible ways of dividing a string into two contiguous sub-
strings.
where, as an alternative way of combining the ker-
nels, we can use the product instead of a summa-
tion: intuitively this increases the feature space di-
mension and makes the similarity measure more
selective. Since decomposition kernels form a
rather vast class, the relation R needs to be care-
fully tuned to different applications in order to
characterize a suitable kernel. As discussed in
the Introduction, however, taking all possible sub-
partsintoaccountmayleadtopoorpredictivitybe-
cause of the combinatorial explosion of the feature
space.
3 Weighted Decomposition Kernels
A weighted decomposition kernel (WDK) is char-
acterized by the following decomposition struc-
ture:
R = 〈 vectorX,R,(δ,κ1,...,κD)〉
where vectorX = (S,Z1,...,ZD), R(s,z1,...,zD,x)
istrue iffs ∈ S isa subpartofxcalledthe selector
andvectorz = (z1,...,zD) ∈ Z1×···×ZD isatupleof
subparts of x called the contexts of s in x. Precise
definitions of s and vectorz are domain-dependent. For
examplein(Menchettietal., 2005)wepresenttwo
formulations, one for comparing whole sequences
(where both the selector and the context are subse-
quences), and one for comparing attributed graphs
(where the selector is a single vertex and the con-
text is the subgraph reachable from the selector
within a short path). The definition is completed
by introducing a kernel on selectors and a kernel
on contexts. The former can be chosen to be the
exact matching kernel, δ, on S × S, defined as
δ(s,sprime) = 1 if s = sprime and δ(s,sprime) = 0 otherwise.
The latter, κd, is a kernel on Zd × Zd and pro-
vides a soft similarity measure based on attribute
frequencies. Several options are available for con-
textkernels,includingthediscreteversionofprob-
ability product kernels (PPK) (Jebara et al., 2004)
and histogram intersection kernels (HIK) (Odone
et al., 2005). Assuming there are n categorical
attributes, each taking on mi distinct values, the
context kernel can be defined as:
κd(z,zprime) =
nsummationdisplay
i=1
ki(z,zprime) (2)
whereki isakernelonthei-thattribute. Denoteby
pi(j) the observed frequency of value j in z. Then
18
ki can be defined as a HIK or a PPK respectively:
ki(z,zprime) =
misummationdisplay
j=1
min{pi(j),pprimei(j)} (3)
ki(z,zprime) =
misummationdisplay
j=1
radicalBig
pi(j)·pprimei(j) (4)
This setting results in the following general form
of the kernel:
K(x,xprime) =
summationdisplay
(s,vectorz) ∈ R−1(x)
(sprime,vectorzprime) ∈ R−1(xprime)
δ(s,sprime)
Dsummationdisplay
d=1
κd(zd,zprimed) (5)
where we can replace the summation of kernels
withproducttextDd=1 1 + κd(zd,zprimed).
Comparedtokernelsthatsimplycountthenum-
ber of substructures, the above function weights
different matches between selectors according to
contextual information. The kernel can be after-
wards normalized in [−1,1] to prevent similarity
to be boosted by the mere size of the structures
being compared.
4 WDK for sequence labeling and
applications to NER
In a sequence labeling task we want to map input
sequences to output sequences, or, more precisely,
we want to map each element of an input sequence
that takes label from a source alphabet to an ele-
ment with label in a destination alphabet.
Here we cast the sequence labeling task into
position specific classification, where different se-
quence positions give independent examples. This
is different from previous approaches in the lit-
erature where the sequence labeling problem is
solved by searching in the output space (Tsochan-
taridis et al., 2004; Daum´e III and Marcu, 2005).
Although the method lacks the potential for col-
lectively labeling all positions simultaneously, it
results in a much more efficient algorithm.
In the remainder of the section we introduce
a specialized version of the weighted decompo-
sition kernel suitable for a sequence transduction
task originating in the natural language process-
ing domain: the named entity recognition (NER)
problem, where we map sentences to sequences of
a reduced number of named entities (see Sec.4.1).
More formally, given a finite dictionary Σ of
words and an input sentence x ∈ Σ∗, our input ob-
jects are pairs of sentences and indices r = (x,t)
Figure 1: Sentence decomposition.
where r ∈ Σ∗ × IN. Given a sentence x, two in-
tegers b ≥ 1 and b ≤ e ≤ |x|, let x[b] denote the
word at position b and x[b..e] the sub-sequence of
x spanning positions from b to e. Finally we will
denote by ξ(x[b]) a word attribute such as a mor-
phological trait (is a number or has capital initial,
see 4.1) for the word in sentence x at position b.
We introduce two versions of WDK: one with
four context types (D = 4) and one with in-
creased contextual information (D = 6) (see
Eq. 5). The relation R depends on two integers
t and i ∈ {1,...,|x|}, where t indicates the po-
sition of the word we want to classify and i the
position of a generic word in the sentence. The
relation for the first kernel version is defined as:
R = {(s,zLL,zLR,zRL,zRR,r)} such that the
selector s = x[i] is the word at position i, the zLL
(LeftLeft) part is a sequence defined as x[1..i] if
i < t or the null sequence ε otherwise and the
zLR (LeftRight) part is the sequence x[i + 1..t] if
i < t or ε otherwise. Informally, zLL is the initial
portion of the sentence up to word of position i,
and zLR is the portion of the sentence from word
at position i + 1 up to t (see Fig. 1). Note that
when we are dealing with a word that lies to the
left of the target word t, its zRL and zRR parts are
empty. Symmetrical definitions hold for zRL and
zRR when i > t. We define the weighted decom-
position kernel for sequences as
K(r,rprime)=
|x|summationdisplay
t=1
|xprime|summationdisplay
tprime=1
δξ(s,sprime)
summationdisplay
d∈{LL,LR,RL,RR}
κ(zd,zprimed) (6)
where δξ(s,sprime) = 1 if ξ(s) = ξ(sprime) and 0 oth-
erwise (that is δξ checks whether the two selector
words have the same morphological trait) and κ
is Eq. 2 with only one attribute which then boils
downtoEq.3orEq.4, thatisakerneloverthehis-
togram for word occurrences over a specific part.
Intuitively, when applied to word sequences,
this kernel considers separately words to the left
19
of the entry we want to transduce and those to
its right. The kernel computes the similarity for
each sub-sequence by matching the corresponding
bag of enriched words: each word is matched only
with words that have the same trait as extracted by
ξ and the match is then weighted proportionally to
the frequency count of identical words preceding
and following it.
The kernel version with D=6 adds two parts
called zLO (LeftOther) and zRO (RightOther) de-
fined as x[t+1..|r|] and x[1..t] respectively; these
representtheremainingsequencepartssothatx =
zLL ◦zLR ◦zLO and x = zRL ◦zRR ◦zRO.
Note that the WDK transforms the sentence
in a bag of enriched words computed in a pre-
processing phase thus achieving a significant re-
ductionincomputationalcomplexity(comparedto
the recursive procedure in (Lodhi et al., 2002)).
4.1 Named Entity Recognition Experimental
Results
Named entities are phrases that contain the names
of persons, organizations, locations, times and
quantities. For example in the following sentence:
[PER Wolff ] , currently a journalist in [LOC
Argentina ] , played with [PER Del Bosque ] in the
final years of the seventies in [ORG Real Madrid].
we are interested in predicting that Wolff and Del
Bosque are people’s names, that Argentina is a
name of a location and that Real Madrid is a name
of an organization.
The chosen dataset is provided by the shared
task of CoNLL–2002 (Saunders et al., 2002)
which concerns language–independent named en-
tity recognition. There are four types of phrases:
person names (PER), organizations (ORG), loca-
tions (LOC) and miscellaneous names (MISC),
combined with two tags, B to denote the first item
ofaphraseandIforanynon–initialword; allother
phrases are classified as (OTHER). Of the two
available languages (Spanish and Dutch), we run
experimentsonlyontheSpanishdatasetwhichisa
collection of news wire articles made available by
the Spanish EFE News Agency. We select a sub-
set of 300 sentences for training and we evaluate
the performance on test set. For each category, we
evaluate the Fβ=1 measure of 4 versions of WDK:
word histograms are matched using HIK (Eq. 3)
and the kernels on various parts (zLL,zLR,etc) are
combined with a summation SUMHIK or product
PROHIK; alternatively the histograms are com-
Table 1: NER experiment D=4
CLASS SUMHIS PROHIS SUMPRO PROPRO
B-LOC 74.33 68.68 72.12 66.47
I-LOC 58.18 52.76 59.24 52.62
B-MISC 52.77 43.31 46.86 39.00
I-MISC 79.98 80.15 77.85 79.65
B-ORG 69.00 66.87 68.42 67.52
I-ORG 76.25 75.30 75.12 74.76
B-PER 60.11 56.60 59.33 54.80
I-PER 65.71 63.39 65.67 60.98
MICRO Fβ=1 69.28 66.33 68.03 65.30
Table 2: NER experiment with D=6
CLASS SUMHIS PROHIS SUMPRO PROPRO
B-LOC 74.81 73.30 73.65 73.69
I-LOC 57.28 58.87 57.76 59.44
B-MISC 56.54 64.11 57.72 62.11
I-MISC 78.74 84.23 79.27 83.04
B-ORG 70.80 73.02 70.48 73.10
I-ORG 76.17 78.70 74.26 77.51
B-PER 66.25 66.84 66.04 67.46
I-PER 68.06 71.81 69.55 69.55
MICRO Fβ=1 70.69 72.90 70.32 72.38
bined with a PPK (Eq. 4) obtaining SUMPPK,
PROPPK.
The word attribute considered for the selector
is a word morphologic trait that classifies a word
in one of five possible categories: normal word,
number, all capital letters, only capital initial and
contains non alphabetic characters, while the con-
text histograms are computed counting the exact
word frequencies.
Results reported in Tab. 1 and Tab. 2 show that
performance is mildly affected by the different
choicesonhowtocombineinformationonthevar-
iouscontexts, thoughitseemsclearthatincreasing
contextual information has a positive influence.
Note that interesting preliminary results can be
obtained even without the use of any refined lan-
guage knowledge, such as part of speech tagging
or shallow/deep parsing.
5 Kernels for word semantic ambiguity
Parsing a natural language sentence often involves
the choice between different syntax structures that
are equally admissible in the given grammar. One
of the most studied ambiguity arise when deciding
between attaching a prepositional phrase either to
the noun phrase or to the verb phrase. An example
could be:
1. eat salad with forks (attach to verb)
2. eat salad with tomatoes (attach to noun)
20
The resolution of such ambiguities is usually per-
formed by the human reader using its past expe-
riences and the knowledge of the words mean-
ing. Machine learning can simulate human experi-
ence by using corpora of disambiguated phrases to
compute a decision on new cases. However, given
the number of different words that are currently
used in texts, there would never be a sufficient
dataset from which to learn. Adding semantic in-
formation on the possible word meanings would
permitthelearningofrulesthatapplytoentirecat-
egories and can be generalized to all the member
words.
5.1 Adding Semantic with WordNet
WordNet (Fellbaum, 1998) is an electronic lexi-
cal database of English words built and annotated
by linguistic researchers. WordNet is an exten-
sive and reliable source of semantic information
that can be used to enrich the representation of a
word. Each word is represented in the database by
a group of synonym sets (synset), with each synset
corresponding to an individual linguistic concept.
AllthesynsetscontainedinWordNetarelinkedby
relations of various types. An important relation
connects a synset to its hypernyms, that are its im-
mediately broader concepts. The hypernym (and
its opposite hyponym) relation defines a semantic
hierarchy of synsets that can be represented as a
directed acyclic graph. The different lexical cat-
egories (verbs, nouns, adjectives and adverbs) are
contained in distinct hierarchies and each one is
rooted by many synsets.
Several metrics have been devised to compute
a similarity score between two words using Word-
Net. In the following we resort to a multiset ver-
sion of the proximity measure used in (Siolas and
d’Alche Buc, 2000), though more refined alterna-
tives are also possible (for example using the con-
ceptual density as in (Basili et al., 2005)). Given
theacyclicnatureofthesemantichierarchies, each
synset can be represented by a group of paths that
followsthehypernymrelationsandfinishinoneof
thetoplevelconcepts. Twopathscanthenbecom-
pared by counting how many steps from the roots
they have in common. This number must then be
normalizeddividingbythesquarerootoftheprod-
uct between the path lengths. In this way one can
accounts for the unbalancing that arise from dif-
ferent parts of the hierarchies being differently de-
tailed. Given two paths pi and piprime, let l and lprime be
their lengths and n be the size of their common
part, the resulting kernel is:
k(pi,piprime) = n√l ·lprime (7)
The demonstration that k is positive definite arise
from the fact that n can be computed as a posi-
tive kernel k∗ by summing the exact match ker-
nels between the corresponding positions in pi and
piprime seen as sequences of synset identifiers. The
lengths l and lprime can then be evaluated as k∗(pi,pi)
and k∗(piprime,piprime) and k is the resulting normalized
version of k∗.
The kernel between two synsets σ and σprime can
then be computed by the multi-set kernel (G¨artner
et al., 2002a) between their corresponding paths.
Synsets are organized into forty-five lexicogra-
pher files based on syntactic category and logical
groupings. Additional information can be derived
by comparing the identifiers λ and λprime of these file
associated to σ and σprime. The resulting synset kernel
is:
κσ(σ,σprime) = δ(λ,λprime) +
summationdisplay
pi∈Π
summationdisplay
piprime∈Πprime
k(pi,piprime) (8)
where Π is the set of paths originating from σ and
the exact match kernel δ(λ,λprime) is 1 if λ ≡ λprime and
0 otherwise. Finally, the kernel κω between two
words is itself a multi-set kernel between the cor-
responding sets of synsets:
κω(ω,ωprime) =
summationdisplay
σ∈Σ
summationdisplay
σprime∈Σprime
κσ(σ,σprime) (9)
where Σ are the synsets associated to the word ω.
5.2 PP Attachment Experimental Results
The experiments have been performed using the
Wall-Street Journal dataset described in (Ratna-
parkhi et al., 1994). This dataset contains 20,800
training examples and 3,097 testing examples.
Each phrase x in the dataset is reduced to a verb
xv, its object noun xn1 and prepositional phrase
formed by a preposition xp and a noun xn2. The
target is either V or N whether the phrase is at-
tachedtotheverborthenoun. Datahavebeenpre-
processed by assigning to all the words their cor-
responding synsets. Additional meanings derived
from specific synsets have been attached to the
words as described in (Stetina and Nagao, 1997).
The kernel between two phrases x and xprime is then
computed by combining the kernels between sin-
gle words using either the sum or the product.
21
Method Acc Pre Rec
S 84.6% ± 0.65% 90.8% 82.2%
P 84.8% ± 0.65% 92.2% 81.0%
SW 85.4% ± 0.64% 90.9% 83.6%
SWL 85.3% ± 0.64% 91.1% 83.2%
PW 85.9% ± 0.63% 92.2% 83.1%
PWL 86.2% ± 0.62% 92.1% 83.7%
Table 3: Summary of the experimental results on
the PP attachment problem for various kernel pa-
rameters.
ResultsoftheexperimentsarereportedinTab.3
for various kernels parameters: S or P denote if
the sum or product of the kernels between words
are used, W denotes that WordNet semantic infor-
mationisadded(otherwisethekernelbetweentwo
wordsisjusttheexactmatchkernel)andLdenotes
that lexicographer files identifiers are used. An ad-
ditionalgaussiankernelisusedontopofKpp. The
C and γ parameters are selected using an inde-
pendent validation set. For each setting, accuracy,
precision and recall values on the test data are re-
ported, along with the standard deviation of the es-
timatedbinomialdistributionoferrors. Theresults
demonstrate that semantic information can help in
resolving PP ambiguities. A small difference ex-
ists between taking the product instead of the sum
of word kernels, and an additional increase in the
amount of information available to the learner is
given by the use of lexicographer files identifiers.
6 Using declarative knowledge for NLP
kernel integration
Data objects in NLP often require complex repre-
sentations; suffice it to say that a sentence is nat-
urally represented as a variable length sequence
of word tokens and that shallow/deep parsers are
reliably used to enrich these representations with
links between words to form parse trees. Finally,
additional complexity can be introduced by in-
cluding semantic information. Various facets of
this richness of representations have been exten-
sively investigated, including the expressiveness
of various grammar formalisms, the exploitation
of lexical representation (e.g. verb subcategoriza-
tion, semantic tagging), and the use of machine
readable sources of generic or specialized knowl-
edge (dictionaries, thesauri, domain specific on-
tologies). All these efforts are capable to address
language specific sub-problems but their integra-
tion into a coherent framework is a difficult feat.
Recent ideas for constructing kernel functions
starting from logical representations may offer an
appealing solution. G¨artner et al. (2002) have pro-
posed a framework for defining kernels on higher-
order logic individuals. Cumby and Roth (2003)
used description logics to represent knowledge
jointlywithpropositionalizationfordefiningaker-
nel function. Frasconi et al. (2004) proposed
kernels for handling supervised learning in a set-
tingsimilartothatofinductivelogicprogramming
where data is represented as a collection of facts
and background knowledge by a declarative pro-
gram in first-order logic. In this section, we briefly
reviewthisapproachandsuggestapossiblewayof
exploiting it for the integration of different sources
of knowledge that may be available in NLP.
6.1 Declarative Kernels
The definition of decomposition kernels as re-
ported in Section 2 is very general and covers al-
most all kernels for discrete structured data de-
veloped in the literature so far. Different kernels
are designed by defining the relation decompos-
ing an example into its “parts”, and specifying
kernels for individual parts. In (Frasconi et al.,
2004) we proposed a systematic approach to such
design, consisting in formally defining a relation
by the set of axioms it must satisfy. We relied
on mereotopology (Varzi, 1996) (i.e. the theory
of parts and places) in order to give a formal def-
inition of the intuitive concepts of parthood and
connection. The formalization of mereotopolog-
ical relations allows to automatically deduce in-
stances of such relations on the data, by exploit-
ing the background knowledge which is typically
available on structured domains. In (Frasconi et
al., 2004) we introduced declarative kernels (DK)
as a set of kernels working on mereotopological
relations, such as that of proper parthood (≺P) or
more complex relations based on connected parts.
A typed syntax for objects was introduced in order
to provide additional flexibility in defining kernels
on instances of the given relation. A basic kernel
on parts KP was defined as follows:
KP(x,xprime)=
summationdisplay
s≺P x
sprime≺P xprime
δT(s,sprime)parenleftbigκ(s,sprime)+KP(s,sprime)parenrightbig(10)
where δT matches objects of the same type and κ
is a kernel over object attributes.
22
Declarative kernels were tested in (Frasconi et
al., 2004) on a number of domains with promising
results, includingabiomedicalinformationextrac-
tion task (Goadrich et al., 2004) aimed at detecting
protein-localization relationships within Medline
abstracts. A DK on parts as the one defined in
Eq. (10) outperformed state-of-the-art ILP-based
systems Aleph and Gleaner (Goadrich etal., 2004)
in such information extraction task, and required
about three orders of magnitude less training time.
6.2 Weighted Decomposition Declarative
Kernels
Declarative kernels can be combined with WDK
in a rather straightforward way, thus taking the ad-
vantages of both methods. A simple approach is
that of using proper parthood in place of selec-
tors, and topology to recover the context of each
proper part. A weighted decomposition declara-
tive kernel (WD2K) of this kind would be defined
as in Eq. (10) simply adding to the attribute ker-
nel κ a context kernel that compares the surround-
ing of a pair of objects—as defined by the topol-
ogyrelation—usingsomeaggregatekernelsuchas
PPK or HIK (see Section 3). Note that such defini-
tion extends WDK by adding recursion to the con-
cept of comparison by selector, and DK by adding
contextstothekernelbetweenparts. Multiplecon-
textscanbeeasilyintroducedbyemployingdiffer-
ent notions of topology, provided they are consis-
tent with mereotopological axioms. As an exam-
ple, if objects are words in a textual document, we
can define l-connection as the relation for which
two words are l-connected if there are consequen-
tial within the text with at most l words in be-
tween, and obtain growingly large contexts by in-
creasing l. Moreover, an extended representation
of words, as the one employing WordNet semantic
information, could be easily plugged in by includ-
ing a kernel for synsets such as that in Section 5.1
into the kernel κ on word attributes. Additional
relations could be easily formalized in order to ex-
ploit specific linguisitc knowledge: a causal rela-
tion would allow to distinguish between preceding
and following context so to take into consideration
word order; an underlap relation, associating two
objects being parts of the same super-object (i.e.
pre-terminalsdominatedbythesamenon-terminal
node), would be able to express commanding no-
tions.
The promising results obtained with declarative
kernels (where only very simple lexical informa-
tion was used) together with the declarative ease
to integrate arbitrary kernels on specific parts are
all encouraging signs that boost our confidence in
this line of research.

References
Roberto Basili, Marco Cammisa, and Alessandro Mos-
chitti. 2005. Effective use of wordnet seman-
tics via kernel-based learning. In 9th Conference
on Computational Natural Language Learning, Ann
Arbor(MI), USA.
S. Ben-David, N. Eiron, and H. U. Simon. 2002. Lim-
itations of learning via embeddings in euclidean half
spaces. J. of Mach. Learning Research, 3:441–461.
M. Collins and N. Duffy. 2002. New ranking algo-
rithms for parsing and tagging: Kernels over dis-
crete structures, and the voted perceptron. In Pro-
ceedings of the Fortieth Annual Meeting on Associa-
tion for Computational Linguistics, pages 263–270,
Philadelphia, PA, USA.
C. Cortes, P. Haffner, and M. Mohri. 2004. Ratio-
nal kernels: Theory and algorithms. J. of Machine
Learning Research, 5:1035–1062.
A. Culotta and J. Sorensen. 2004. Dependency tree
kernels for relation extraction. In Proc. of the 42nd
Annual Meeting of the Association for Computa-
tional Linguistics, pages 423–429.
C. M. Cumby and D. Roth. 2003. On kernel meth-
ods for relational learning. In Proc. Int. Conference
on Machine Learning (ICML’03), pages 107–114,
Washington, DC, USA.
H. Daum´e III and D. Marcu. 2005. Learning as search
optimization: Approximate large margin methods
for structured prediction. In International Confer-
ence on Machine Learning (ICML), pages 169–176,
Bonn, Germany.
C. Fellbaum, editor. 1998. WordNet: An Electronic
Lexical Database. The MIT Press.
P. Frasconi, S. Muggleton, H. Lodhi, and A. Passerini.
2004. Declarative kernels. Technical Report RT
2/2004, Universit`a di Firenze.
T. G¨artner, P. A. Flach, A. Kowalczyk, and A. J. Smola.
2002a. Multi-instance kernels. In C. Sammut and
A. Hoffmann, editors, Proceedings of the 19th In-
ternational Conference on Machine Learning, pages
179–186. Morgan Kaufmann.
T. G¨artner, J.W. Lloyd, and P.A. Flach. 2002b. Ker-
nels for structured data. In S. Matwin and C. Sam-
mut, editors, Proceedings of the 12th International
Conference on Inductive Logic Programming, vol-
ume 2583 of Lecture Notes in Artificial Intelligence,
pages 66–83. Springer-Verlag.
T. G¨artner. 2003. A survey of kernels for structured
data. SIGKDD Explorations Newsletter, 5(1):49–
58.
M. Goadrich, L. Oliphant, and J. W. Shavlik. 2004.
Learning ensembles of first-order clauses for recall-
precision curves: A case study in biomedical infor-
mation extraction. In Proc. 14th Int. Conf. on Induc-
tive Logic Programming, ILP ’04, pages 98–115.
D. Haussler. 1999. Convolution kernels on discrete
structures. Technical Report UCSC-CRL-99-10,
University of California, Santa Cruz.
T. Horv´ath, T. G¨artner, and S. Wrobel. 2004. Cyclic
pattern kernels for predictive graph mining. In Pro-
ceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, pages 158–167. ACM Press.
T. Jebara, R. Kondor, and A. Howard. 2004. Proba-
bility product kernels. J. Mach. Learn. Res., 5:819–
844.
H. Kashima and T. Koyanagi. 2002. Kernels for
Semi–Structured Data. In Proceedings of the Nine-
teenth International Conference on Machine Learn-
ing, pages 291–298.
H. Kashima, K. Tsuda, and A. Inokuchi. 2003.
Marginalized kernels between labeled graphs. In
Proceedings of the Twentieth International Confer-
ence on Machine Learning, pages 321–328, Wash-
ington, DC, USA.
C. S. Leslie, E. Eskin, and W. S. Noble. 2002. The
spectrum kernel: A string kernel for SVM protein
classification. In Pacific Symposium on Biocomput-
ing, pages 566–575.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristian-
ini, and C. Watkins. 2002. Text classification us-
ing string kernels. Journal of Machine Learning Re-
search, 2:419–444.
S. Menchetti, F. Costa, and P. Frasconi. 2005.
Weighted decomposition kernels. In Proceedings of
the Twenty-second International Conference on Ma-
chine Learning, pages 585–592, Bonn, Germany.
Alessandro Moschitti. 2004. A study on convolution
kernels for shallow semantic parsing. In 42-th Con-
ference on Association for Computational Linguis-
tic, Barcelona, Spain.
F. Odone, A. Barla, and A. Verri. 2005. Building ker-
nels from binary strings for image matching. IEEE
Transactions on Image Processing, 14(2):169–180.
A Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A
maximum entropy model for prepositional phrase
attachment. In Proceedings of the ARPA Human
Language Technology Workshop, pages 250–255,
Plainsboro, NJ.
C. Saunders, H. Tschach, and J. Shawe-Taylor. 2002.
Syllables and other string kernel extensions. In Pro-
ceedings of the Nineteenth International Conference
on Machine Learning, pages 530–537.
B. Sch¨olkopf, J. Weston, E. Eskin, C. S. Leslie, and
W. S. Noble. 2002. A kernel approach for learn-
ing from almost orthogonal patterns. In Proc. of
ECML’02, pages 511–528.
J. Shawe-Taylor and N. Cristianini. 2004. Kernel
Methods for Pattern Analysis. Cambridge Univer-
sity Press.
G. Siolas and F. d’Alche Buc. 2000. Support vector
machines based on a semantic kernel for text cate-
gorization. In Proceedings of the IEEE-INNS-ENNS
International Joint Conference on Neural Networks,
volume 5, pages 205 – 209.
A.J. Smola and R. Kondor. 2003. Kernels and regular-
ization on graphs. In B. Sch¨olkopf and M.K. War-
muth, editors, 16th Annual Conference on Compu-
tational Learning Theory and 7th Kernel Workshop,
COLT/Kernel 2003, volume 2777 of Lecture Notes
in Computer Science, pages 144–158. Springer.
J Stetina and M Nagao. 1997. Corpus based pp attach-
ment ambiguity resolution with a semantic dictio-
nary. In Proceedings of the Fifth Workshop on Very
Large Corpora, pages 66–80, Beijing, China.
J. Suzuki, H. Isozaki, and E. Maeda. 2004. Convo-
lution kernels with feature selection for natural lan-
guage processing tasks. In Proc. of the 42nd Annual
Meeting of the Association for Computational Lin-
guistics, pages 119–126.
I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Al-
tun. 2004. Support vector machine learning for in-
terdependent and structured output spaces. In Proc.
21st Int. Conf. on Machine Learning, pages 823–
830, Banff, Alberta, Canada.
A.C. Varzi. 1996. Parts, wholes, and part-whole re-
lations: the prospects of mereotopology. Data and
Knowledge Engineering, 20:259–286.
D. Zelenko, C. Aone, and A. Richardella. 2003. Ker-
nel methods for relation extraction. Journal of Ma-
chine Learning Research, 3:1083–1106.
