Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 1–8, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Effective use of WordNet semantics via kernel-based learning
Roberto Basili and Marco Cammisa and Alessandro Moschitti
Department of Computer Science
University of Rome ”Tor Vergata”, Rome, Italy
{basili,cammisa,moschitti}@info.uniroma2.it
Abstract
Research on document similarity has
shown that complex representations are
not more accurate than the simple bag-of-
words. Term clustering, e.g. using latent
semantic indexing, word co-occurrences
or synonym relations using a word ontol-
ogy have been shown not very effective.
In particular, when to extend the similar-
ity function external prior knowledge is
used, e.g. WordNet, the retrieval system
decreases its performance. The critical is-
sues here are methods and conditions to
integrate such knowledge.
In this paper we propose kernel func-
tions to add prior knowledge to learn-
ing algorithms for document classifica-
tion. Such kernels use a term similarity
measure based on the WordNet hierarchy.
The kernel trick is used to implement such
space in a balanced and statistically co-
herent way. Cross-validation results show
the benefit of the approach for the Support
Vector Machines when few training data is
available.
1 Introduction
The large literature on term clustering, term sim-
ilarity and weighting schemes shows that docu-
ment similarity is a central topic in Information Re-
trieval (IR). The research efforts have mostly been
directed in enriching the document representation
by using clustering (term generalization) or adding
compounds (term specifications). These studies are
based on the assumption that the similarity between
two documents can be expressed as the similarity be-
tween pairs of matching terms. Following this idea,
term clustering methods based on corpus term dis-
tributions or on external prior knowledge (e.g. pro-
vided by WordNet) were used to improve the basic
term matching.
An example of statistical clustering is given in
(Bekkerman et al., 2001). A feature selection tech-
nique, which clusters similar features/words, called
the Information Bottleneck (IB), was applied to Text
Categorization (TC). Such cluster based representa-
tion outperformed the simple bag-of-words on only
one out of the three experimented collections. The
effective use of external prior knowledge is even
more difficult since no attempt has ever been suc-
cessful to improve document retrieval or text classi-
fication accuracy, (e.g. see (Smeaton, 1999; Sussna,
1993; Voorhees, 1993; Voorhees, 1994; Moschitti
and Basili, 2004)).
The main problem of term cluster based represen-
tations seems the unclear nature of the relationship
between the word and the cluster information lev-
els. Even if (semantic) clusters tend to improve the
system recall, simple terms are, on a large scale,
more accurate (e.g. (Moschitti and Basili, 2004)).
To overcome this problem, hybrid spaces containing
terms and clusters were experimented (e.g. (Scott
and Matwin, 1999)) but the results, again, showed
that the mixed statistical distributions of clusters and
terms impact either marginally or even negatively on
the overall accuracy.
In (Voorhees, 1993; Smeaton, 1999), clusters of
synonymous terms as defined in WordNet (WN)
(Fellbaum, 1998) were used for document retrieval.
The results showed that the misleading information
due to the wrong choice of the local term senses
causes the overall accuracy to decrease. Word sense
disambiguation (WSD) was thus applied beforehand
by indexing the documents by means of disam-
biguated senses, i.e. synset codes (Smeaton, 1999;
1
Sussna, 1993; Voorhees, 1993; Voorhees, 1994;
Moschitti and Basili, 2004). However, even the
state-of-the-art methods for WSD did not improve
the accuracy because of the inherent noise intro-
duced by the disambiguation mistakes. The above
studies suggest that term clusters decrease the pre-
cision of the system as they force weakly related or
unrelated (in case of disambiguation errors) terms to
give a contribution in the similarity function. The
successful introduction of prior external knowledge
relies on the solution of the above problem.
In this paper, a model to introduce the semantic
lexical knowledge contained in the WN hierarchy
in a supervised text classification task has been pro-
posed. Intuitively, the main idea is that the docu-
ments d are represented through the set of all pairs
in the vocabulary < t,tprime >∈ V ×V originating by
the terms t ∈ d and all the words tprime ∈ V , e.g. the
WN nouns. When the similarity between two docu-
ments is evaluated, their matching pairs are used to
account for the final score. The weight given to each
term pair is proportional to the similarity that the two
terms have in WN. Thus, the term t of the first docu-
ment contributes to the document similarity accord-
ing to its relatedness with any of the terms of the
second document and the prior external knowledge,
provided by WN, quantifies the single term to term
relatedness. Such approach has two advantages: (a)
we obtain a well defined space which supports the
similarity between terms of different surface forms
based on external knowledge and (b) we avoid to
explicitly define term or sense clusters which in-
evitably introduce noise.
The class of spaces which embeds the above pair
information may be composed by O(|V|2) dimen-
sions. If we consider only the WN nouns (about
105), our space contains about 1010 dimensions
which is not manageable by most of the learning al-
gorithms. Kernel methods, can solve this problem as
they allow us to use an implicit space representation
in the learning algorithms. Among them Support
Vector Machines (SVMs) (Vapnik, 1995) are kernel
based learners which achieve high accuracy in pres-
ence of many irrelevant features. This is another im-
portant property as selection of the informative pairs
is left to the SVM learning.
Moreover, as we believe that the prior knowledge
in TC is not so useful when there is a sufficient
amount of training documents, we experimented our
model in poor training conditions (e.g. less equal
than 20 documents for each category). The improve-
ments in the accuracy, observed on the classification
of the well known Reuters and 20 NewsGroups cor-
pora, show that our document similarity model is
very promising for general IR tasks: unlike previous
attempts, it makes sense of the adoption of semantic
external resources (i.e. WN) in IR.
Section 2 introduces the WordNet-based term
similarity. Section 3 defines the new document simi-
larity measure, the kernel function and its use within
SVMs. Section 4 presents the comparative results
between the traditional linear and the WN-based
kernels within SVMs. In Section 5 comparative dis-
cussion against the related IR literature is carried
out. Finally Section 6 derives the conclusions.
2 Term similarity based on general
knowledge
In IR, any similarity metric in the vector space mod-
els is driven by lexical matching. When small train-
ing material is available, few words can be effec-
tively used and the resulting document similarity
metrics may be inaccurate. Semantic generaliza-
tions overcome data sparseness problems as con-
tributions from different but semantically similar
words are made available.
Methods for the induction of semantically in-
spired word clusters have been widely used in lan-
guage modeling and lexical acquisition tasks (e.g.
(Clark and Weir, 2002)). The resource employed
in most works is WordNet (Fellbaum, 1998) which
contains three subhierarchies: for nouns, verbs and
adjectives. Each hierarchy represents lexicalized
concepts (or senses) organized according to an ”is-
a-kind-of ” relation. A concept s is described by
a set of words syn(s) called synset. The words
w ∈ syn(s) are synonyms according to the sense
s.
For example, the words line, argumentation, logi-
cal argument and line of reasoning describe a synset
which expresses the methodical process of logical
reasoning (e.g. ”I can’t follow your line of reason-
ing”). Each word/term may be lexically related to
more than one synset depending on its senses. The
word line is also a member of the synset line, divid-
ing line, demarcation and contrast, as a line denotes
2
also a conceptual separation (e.g. ”there is a nar-
row line between sanity and insanity”). The Wordnet
noun hierarchy is a direct acyclic graph1 in which
the edges establish the direct isa relations between
two synsets.
2.1 The Conceptual Density
The automatic use of WordNet for NLP and IR tasks
has proved to be very complex. First, how the topo-
logical distance among senses is related to their cor-
responding conceptual distance is unclear. The per-
vasive lexical ambiguity is also problematic as it im-
pacts on the measure of conceptual distances be-
tween word pairs. Second, the approximation of a
set of concepts by means of their generalization in
the hierarchy implies a conceptual loss that affects
the target IR (or NLP) tasks. For example, black
and white are colors but are also chess pieces and
this impacts on the similarity score that should be
used in IR applications. Methods to solve the above
problems attempt to map a priori the terms to spe-
cific generalizations levels, i.e. to cuts in the hier-
archy (e.g. (Li and Abe, 1998; Resnik, 1997)), and
use corpus statistics for weighting the resulting map-
pings. For several tasks (e.g. in TC) this is unsatis-
factory: different contexts of the same corpus (e.g.
documents) may require different generalizations of
the same word as they independently impact on the
document similarity.
On the contrary, the Conceptual Density (CD)
(Agirre and Rigau, 1996) is a flexible semantic simi-
larity which depends on the generalizations of word
senses not referring to any fixed level of the hier-
archy. The CD defines a metrics according to the
topological structure of WordNet and can be seem-
ingly applied to two or more words. The measure
formalized hereafter adapt to word pairs a more gen-
eral definition given in (Basili et al., 2004).
We denote by ¯s the set of nodes of the hierarchy
rooted in the synset s, i.e. {c ∈ S|c isa s}, where S
is the set of WN synsets. By definition ∀s ∈ S,s ∈
¯s. CD makes a guess about the proximity of the
senses, s1 and s2, of two words u1 and u2, accord-
ing to the information expressed by the minimal sub-
hierarchy, ¯s, that includes them. Let Si be the set of
1As only the 1% of its nodes own more than one parent in
the graph, most of the techniques assume the hierarchy to be a
tree, and treat the few exception heuristically.
generalizations for at least one sense si of the word
ui, i.e. Si = {s ∈ S|si ∈ ¯s,ui ∈ syn(si)}. The
CD of u1 and u2 is:
CD(u1,u2) =



0 iff S1 ∩S2 = ∅
maxs∈S1∩S2
summationtexth
i=0(µ(¯s))i
|¯s|
otherwise
(1)
where:
• S1∩S2 is the set of WN shared generalizations
(i.e. the common hypernyms) of u1 and u2
• µ(¯s) is the average number of children per node
(i.e. the branching factor) in the sub-hierarchy
¯s. µ(¯s) depends on WordNet and in some cases
its value can approach 1.
• h is the depth of the ideal, i.e. maximally
dense, tree with enough leaves to cover the
two senses, s1 and s2, according to an average
branching factor of µ(¯s). This value is actually
estimated by:
h =
braceleftbigg floorleftlog
µ(¯s)2floorright iff µ(¯s) negationslash= 1
2 otherwise (2)
When µ(s)=1, h ensures a tree with at least 2
nodes to cover s1 and s2 (height = 2).
• |¯s| is the number of nodes in the sub-hierarchy
¯s. This value is statically measured on WN and
it is a negative bias for the higher level general-
izations (i.e. larger ¯s).
CD models the semantic distance as the density
of the generalizations s ∈ S1 ∩S2. Such density is
the ratio between the number of nodes of the ideal
tree and |¯s|. The ideal tree should (a) link the two
senses/nodes s1 and s2 with the minimal number
of edges (isa-relations) and (b) maintain the same
branching factor (bf ) observed in ¯s. In other words,
this tree provides the minimal number of nodes (and
isa-relations) sufficient to connect s1 and s2 accord-
ing to the topological structure of ¯s. For example, if
¯s has a bf of 2 the ideal tree connects the two senses
with a single node (their father). If the bf is 1.5, to
replicate it, the ideal tree must contain 4 nodes, i.e.
the grandfather which has a bf of 1 and the father
which has bf of 2 for an average of 1.5. When bf is
1 the Eq. 1 degenerates to the inverse of the number
of nodes in the path between s1 and s2, i.e. the sim-
ple proximity measure used in (Siolas and d’Alch
Buc, 2000).
3
It is worth noting that for each pair CD(u1,u2)
determines the similarity according to the closest
lexical senses, s1, s2 ∈ ¯s: the remaining senses of u1
and u2 are irrelevant, with a resulting semantic dis-
ambiguation side effect. CD has been successfully
applied to semantic tagging ((Basili et al., 2004)).
As the WN hierarchies for other POS classes (i.e.
verb and adjectives) have topological properties dif-
ferent from the noun hyponimy network, their se-
mantics is not suitably captured by Eq. 1. In this
paper, Eq. 1 has thus been only applied to noun
pairs. As the high number of such pairs increases
the computational complexity of the target learn-
ing algorithm, efficient approaches are needed. The
next section describes how kernel methods can make
practical the use of the Conceptual Density in Text
Categorization.
3 A WordNet Kernel for document
similarity
Term similarities are used to design document simi-
larities which are the core functions of most TC al-
gorithms. The term similarity proposed in Eq. 1
is valid for all term pairs of a target vocabulary and
has two main advantages: (1) the relatedness of each
term occurring in the first document can be com-
puted against all terms in the second document, i.e.
all different pairs of similar (not just identical) to-
kens can contribute and (2) if we use all term pair
contributions in the document similarity we obtain a
measure consistent with the term probability distri-
butions, i.e. the sum of all term contributions does
not penalize or emphasize arbitrarily any subset of
terms. The next section presents more formally the
above idea.
3.1 A semantic vector space
Given two documents d1 and d2 ∈ D (the document-
set) we define their similarity as:
K(d1,d2) =
summationdisplay
w1∈d1,w2∈d2
(λ1λ2)×σ(w1,w2) (3)
where λ1 and λ2 are the weights of the words (fea-
tures) w1 and w2 in the documents d1 and d2, re-
spectively and σ is a term similarity function, e.g.
the conceptual density defined in Section 2. To
prove that Eq. 3 is a valid kernel is enough to
show that it is a specialization of the general defi-
nition of convolution kernels formalized in (Haus-
sler, 1999). Hereafter, we report such definition. Let
X,X1,..,Xm be separable metric spaces, x ∈ X
a structure and vectorx = x1,...,xm its parts, where
xi ∈ Xi ∀i = 1,..,m. Let R be a relation on
the set X×X1×..×Xm such that R(vectorx,x) is ”true”
if vectorx are the parts of x. We indicate with R−1(x) the
set {vectorx : R(vectorx,x)}. Given two objects x and y ∈ X
their similarity K(x,y) is defined as:
K(x,y) =
summationdisplay
vectorx∈R−1(x)
summationdisplay
vectory∈R−1(y)
mproductdisplay
i=1
Ki(xi,yi) (4)
If X defines the document set (i.e. D = X),
and X1 the vocabulary of the target document corpus
(X1 = V ), it follows that: x = d (a document), vectorx =
x1 = w ∈ V (a word which is a part of the document
d) and R−1(d) defines the set of words in the doc-
ument d. As producttextmi=1 Ki(xi,yi) = K1(x1,y1), then
K1(x1,y1) = K(w1,w2) = (λ1λ2) × σ(w1,w2),
i.e. Eq. 3.
The above equation can be used in support vector
machines as illustrated by the next section.
3.2 Support Vector Machines and Kernel
methods
Given the vector space in Rη and a set of positive
and negative points, SVMs classify vectors accord-
ing to a separating hyperplane, H(vectorx) = vectorω·vectorx+b = 0,
wherevectorx and vectorω ∈Rη and b ∈Rare learned by apply-
ing the Structural Risk Minimization principle (Vap-
nik, 1995). From the kernel theory we have that:
H(vectorx) =
parenleftBig summationdisplay
h=1..l
αh vectorxh
parenrightBig
·vectorx+b =
summationdisplay
h=1..l
αhvectorxh·vectorx+b =
summationdisplay
h=1..l
αhφ(dh)·φ(d)+b =
summationdisplay
h=1..l
αhK(dh,d)+b
(5)
where, d is a classifying document and dh are all the
l training instances, projected in vectorx and vectorxh respec-
tively. The product K(d,dh) =<φ(d) · φ(dh)> is
the Semantic WN-based Kernel (SK) function asso-
ciated with the mapping φ.
Eq. 5 shows that to evaluate the separating hy-
perplane inRη we do not need to evaluate the entire
vector vectorxh or vectorx. Actually, we do not know even the
mapping φ and the number of dimensions, η. As
it is sufficient to compute K(d,dh), we can carry
out the learning with Eq. 3 in the Rn, avoiding to
4
use the explicit representation in the Rη space. The
real advantage is that we can consider only the word
pairs associated with non-zero weight, i.e. we can
use a sparse vector computation. Additionally, to
have a uniform score across different document size,
the kernel function can be normalized as follows:
SK(d1,d2)√
SK(d1,d1)·SK(d2,d2)
4 Experiments
The use of WordNet (WN) in the term similarity
function introduces a prior knowledge whose impact
on the Semantic Kernel (SK) should be experimen-
tally assessed. The main goal is to compare the tradi-
tional Vector Space Model kernel against SK, both
within the Support Vector learning algorithm.
The high complexity of the SK limits the size
of the experiments that we can carry out in a fea-
sible time. Moreover, we are not interested to large
collections of training documents as in these train-
ing conditions the simple bag-of-words models are
in general very effective, i.e. they seems to model
well the document similarity needed by the learning
algorithms. Thus, we carried out the experiments
on small subsets of the 20NewsGroups2 (20NG)
and the Reuters-215783 corpora to simulate critical
learning conditions.
4.1 Experimental set-up
For the experiments, we used the SVM-
light software (Joachims, 1999) (available at
svmlight.joachims.org) with the default linear
kernel on the token space (adopted as the baseline
evaluations). For the SK evaluation we imple-
mented the Eq. 3 with σ(·,·) = CD(·,·) (Eq. 1)
inside SVM-light. As Eq. 1 is only defined for
nouns, a part of speech (POS) tagger has been previ-
ously applied. However, also verbs, adjectives and
numerical features were included in the pair space.
For these tokens a CD = 0 is assigned to pairs
made by different strings. As the POS-tagger could
introduce errors, in a second experiment, any token
with a successful look-up in the WN noun hierarchy
was considered in the kernel. This approximation
has the benefit to retrieve useful information even
2Available at www.ai.mit.edu/people/jrennie/
20Newsgroups/.
3The Apt´e split available at kdd.ics.uci.edu/
databases/reuters21578/reuters21578.html.
for verbs and capture the similarity between verbs
and some nouns, e.g. to drive (via the noun drive)
has a common synset with parkway.
For the evaluations, we applied a careful SVM
parameterization: a preliminary investigation sug-
gested that the trade-off (between the training-set er-
ror and margin, i.e. c option in SVM-light) parame-
ter optimizes the F1 measure for values in the range
[0.02,0.32]4. We noted also that the cost-factor pa-
rameter (i.e. j option) is not critical, i.e. a value of
10 always optimizes the accuracy. The feature se-
lection techniques and the weighting schemes were
not applied in our experiments as they cannot be ac-
curately estimated from the small available training
data.
The classification performance was evaluated by
means of the F1 measure5 for the single category and
the MicroAverage for the final classifier pool (Yang,
1999). Given the high computational complexity of
SK we selected 8 categories from the 20NG6 and 8
from the Reuters corpus7
To derive statistically significant results with few
training documents, for each corpus, we randomly
selected 10 different samples from the 8 categories.
We trained the classifiers on one sample, parameter-
ized on a second sample and derived the measures
on the other 8. By rotating the training sample we
obtained 80 different measures for each model. The
size of the samples ranges from 24 to 160 documents
depending on the target experiment.
4.2 Cross validation results
The SK (Eq. 3) was compared with the linear kernel
which obtained the best F1 measure in (Joachims,
1999). Table 1 reports the first comparative results
for 8 categories of 20NG on 40 training documents.
The results are expressed as the Mean and the Std.
Dev. over 80 runs. The F1 are reported in Column 2
for the linear kernel, i.e. bow, in Column 3 for SK
without applying POS information and in Column 4
4We used all the values from 0.02 to 0.32 with step 0.02.
5F1 assigns equal importance to Precision P and Recall R,
i.e. F1 = 2P·RP+R.
6We selected the 8 most different categories (in terms of
their content) i.e. Atheism, Computer Graphics, Misc Forsale,
Autos, Sport Baseball, Medicine, Talk Religions and Talk Poli-
tics.
7We selected the 8 largest categories, i.e. Acquisition, Earn,
Crude, Grain, Interest, Money-fx, Trade and Wheat.
5
for SK with the use of POS information (SK-POS).
The last row shows the MicroAverage performance
for the above three models on all 8 categories. We
note that SK improves bow of 3%, i.e. 34.3% vs.
31.5% and that the POS information reduces the im-
provement of SK, i.e. 33.5% vs. 34.3%.
To verify the hypothesis that WN information is
useful in low training data conditions we repeated
the evaluation over the 8 categories of Reuters with
samples of 24 and 160 documents, respectively. The
results reported in Table 2 shows that (1) again SK
improves bow (41.7% - 37.2% = 4.5%) and (2) as
the number of documents increases the improvement
decreases (77.9% - 75.9% = 2%). It is worth noting
that the standard deviations tend to assume high val-
ues. In general, the use of 10 disjoint training/testing
samples produces a higher variability than the n-
fold cross validation which insists on the same docu-
ment set. However, this does not affect the t-student
confidence test over the differences between the Mi-
croAverage of SK and bow since the former has a
higher accuracy at 99% confidence level.
The above findings confirm that SK outperforms
the bag-of-words kernel in critical learning condi-
tions as the semantic contribution of the SK recov-
ers useful information. To complete this study we
carried out experiments with samples of different
size, i.e. 3, 5, 10, 15 and 20 documents for each
category. Figures 1 and 2 show the learning curves
for 20NG and Reuters corpora. Each point refers to
the average on 80 samples.
As expected the improvement provided by SK
decreases when more training data is available.
However, the improvements are not negligible yet.
The SK model (without POS information) pre-
serves about 2-3% of improvement with 160 training
documents. The matching allowed between noun-
verb pairs still captures semantic information which
is useful for topic detection. In particular, during
the similarity estimation, each word activates 60.05
pairs on average. This is particularly useful to in-
crease the amount of information available to the
SVMs.
Finally, we carried out some experiments with
160 Reuters documents by discarding the string
matching from SK. Only words having different
surface forms were allowed to give contributions to
the Eq. 3.
Category bow SK SK-POS
Atheism 29.5±19.8 32.0±16.3 25.2±17.2
Comp.Graph 39.2±20.7 39.3±20.8 29.3±21.8
Misc.Forsale 61.3±17.7 51.3±18.7 49.5±20.4
Autos 26.2±22.7 26.0±20.6 33.5±26.8
Sport.Baseb. 32.7±20.1 36.9±22.5 41.8±19.2
Sci.Med 26.1±17.2 18.5±17.4 16.6±17.2
Talk.Relig. 23.5±11.6 28.4±19.0 27.6±17.0
Talk.Polit. 28.3±17.5 30.7±15.5 30.3±14.3
MicroAvg. F1 31.5±4.8 34.3±5.8 33.5±6.4
Table 1: Performance of the linear and Semantic Kernel with
40 training documents over 8 categories of 20NewsGroups col-
lection.
Category 24 docs 160 docs
bow SK bow SK
Acq. 55.3±18.1 50.8±18.1 86.7±4.6 84.2±4.3
Crude 3.4±5.6 3.5±5.7 64.0±20.6 62.0±16.7
Earn 64.0±10.0 64.7±10.3 91.3±5.5 90.4±5.1
Grain 45.0±33.4 44.4±29.6 69.9±16.3 73.7±14.8
Interest 23.9±29.9 24.9±28.6 67.2±12.9 59.8±12.6
Money-fx 36.1±34.3 39.2±29.5 69.1±11.9 67.4±13.3
Trade 9.8±21.2 10.3±17.9 57.1±23.8 60.1±15.4
Wheat 8.6±19.7 13.3±26.3 23.9±24.8 31.2±23.0
Mic.Avg. 37.2±5.9 41.7±6.0 75.9±11.0 77.9±5.7
Table 2: Performance of the linear and Semantic Kernel with
40 and 160 training documents over 8 categories of the Reuters
corpus.
30.0
33.0
36.0
39.0
42.0
45.0
48.0
51.0
54.0
40 60 80 100 120 140 160# Training Documents
M
i
c
r
o
-
A
v
e
r
a
g
e
 
F
1
bowSK
SK-POS
Figure 1: MicroAverage F1 of SVMs using bow, SK and
SK-POS kernels over the 8 categories of 20NewsGroups.
The important outcome is that SK converges to a
MicroAverage F1 measure of 56.4% (compare with
Table 2). This shows that the word similarity pro-
vided by WN is still consistent and, although in the
worst case, slightly effective for TC: the evidence
is that a suitable balancing between lexical ambigu-
ity and topical relatedness is captured by the SVM
learning.
6
35.0
40.0
45.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
20 40 60 80 100 120 140 160# Training Documents
M
i
c
r
o
-
A
v
e
r
a
g
e
 
F
1
bowSK
Figure 2: MicroAverage F1 of SVMs using bow and SK over
the 8 categories of the Reuters corpus.
5 Related Work
The IR studies in this area focus on the term similar-
ity models to embed statistical and external knowl-
edge in document similarity.
In (Kontostathis and Pottenger, 2002) a Latent Se-
mantic Indexing analysis was used for term cluster-
ing. Such approach assumes that values xij in the
transformed term-term matrix represents the simi-
larity (> 0) and anti-similarity between terms i and
j. By extension, a negative value represents an anti-
similarity between i and j enabling both positive and
negative clusters of terms. Evaluation of query ex-
pansion techniques showed that positive clusters can
improve Recall of about 18% for the CISI collection,
2.9% for MED and 3.4% for CRAN. Furthermore,
the negative clusters, when used to prune the result
set, improve the precision.
The use of external semantic knowledge seems
to be more problematic in IR. In (Smeaton, 1999),
the impact of semantic ambiguity on IR is stud-
ied. A WN-based semantic similarity function be-
tween noun pairs is used to improve indexing and
document-query matching. However, the WSD al-
gorithm had a performance ranging between 60-
70%, and this made the overall semantic similarity
not effective.
Other studies using semantic information for im-
proving IR were carried out in (Sussna, 1993) and
(Voorhees, 1993; Voorhees, 1994). Word seman-
tic information was here used for text indexing and
query expansion, respectively. In (Voorhees, 1994)
it is shown that semantic information derived di-
rectly from WN without a priori WSD produces
poor results.
The latter methods are even more problematic in
TC (Moschitti and Basili, 2004). Word senses tend
to systematically correlate with the positive exam-
ples of a category. Different categories are better
characterized by different words rather than differ-
ent senses. Patterns of lexical co-occurrences in the
training data seem to suffice for automatic disam-
biguation. (Scott and Matwin, 1999) use WN senses
to replace simple words without word sense disam-
biguation and small improvements are derived only
for a small corpus. The scale and assessment pro-
vided in (Moschitti and Basili, 2004) (3 corpora us-
ing cross-validation techniques) showed that even
the accurate disambiguation of WN senses (about
80% accuracy on nouns) did not improve TC.
In (Siolas and d’Alch Buc, 2000) was proposed
an approach similar to the one presented in this ar-
ticle. A term proximity function is used to design
a kernel able to semantically smooth the similarity
between two document terms. Such semantic ker-
nel was designed as a combination of the Radial Ba-
sis Function (RBF) kernel with the term proximity
matrix. Entries in this matrix are inversely propor-
tional to the length of the WN hierarchy path link-
ing the two terms. The performance, measured over
the 20NewsGroups corpus, showed an improvement
of 2% over the bag-of-words. Three main differ-
ences exist with respect to our approach. First, the
term proximity does not fully capture the WN topo-
logical information. Equidistant terms receive the
same similarity irrespectively from their generaliza-
tion level. For example, Sky and Location (direct
hyponyms of Entity) receive a similarity score equal
to knife and gun (hyponyms of weapon). More ac-
curate measures have been widely discussed in lit-
erature, e.g. (Resnik, 1997). Second, the kernel-
based CD similarity is an elegant combination of
lexicalized and semantic information. In (Siolas and
d’Alch Buc, 2000) the combination of weighting
schemes, the RBF kernel and the proximitry matrix
has a much less clear interpretation. Finally, (Siolas
and d’Alch Buc, 2000) selected only 200 features
via Mutual Information statistics. In this way rare
or non statistically significant terms are neglected
while being source of often relevant contributions in
the SK space modeled over WN.
Other important work on semantic kernel for re-
trieval has been developed in (Cristianini et al.,
7
2002; Kandola et al., 2002). Two methods for in-
ferring semantic similarity from a corpus were pro-
posed. In the first a system of equations were de-
rived from the dual relation between word-similarity
based on document-similarity and viceversa. The
equilibrium point was used to derive the semantic
similarity measure. The second method models se-
mantic relations by means of a diffusion process on
a graph defined by lexicon and co-occurrence in-
formation. The major difference with our approach
is the use of a different source of prior knowledge.
Similar techniques were also applied in (Hofmann,
2000) to derive a Fisher kernel based on a latent class
decomposition of the term-document matrix.
6 Conclusions
The introduction of semantic prior knowledge in
IR has always been an interesting subject as the
examined literature suggests. In this paper, we
used the conceptual density function on the Word-
Net (WN) hierarchy to define a document similar-
ity metric. Accordingly, we defined a semantic
kernel to train Support Vector Machine classifiers.
Cross-validation experiments over 8 categories of
20NewsGroups and Reuters over multiple samples
have shown that in poor training data conditions, the
WN prior knowledge can be effectively used to im-
prove (up to 4.5 absolute percent points, i.e. 10%)
the TC accuracy.
These promising results enable a number of future
researches: (1) larger scale experiments with differ-
ent measures and semantic similarity models (e.g.
(Resnik, 1997)); (2) improvement of the overall ef-
ficiency by exploring feature selection methods over
the SK, and (3) the extension of the semantic sim-
ilarity by a general (i.e. non binary) application of
the conceptual density model.

References
E. Agirre and G. Rigau. 1996. Word sense disambiguation
using conceptual density. In Proceedings of COLING’96,
Copenhagen, Danmark.
R. Basili, M. Cammisa, and F. M. Zanzotto. 2004. A similar-
ity measure for unsupervised semantic disambiguation. In
In Proceedings of Language Resources and Evaluation Con-
ference, Lisbon, Portugal.
Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Win-
ter. 2001. On feature distributional clustering for text cat-
egorization. In Proceedings of SIGIR’01 , New Orleans,
Louisiana, US.
Stephen Clark and David Weir. 2002. Class-based probability
estimation using a semantic hierarchy. Comput. Linguist.,
28(2):187–206.
Nello Cristianini, John Shawe-Taylor, and Huma Lodhi. 2002.
Latent semantic kernels. J. Intell. Inf. Syst., 18(2-3):127–
152.
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical
Database. MIT Press.
D. Haussler. 1999. Convolution kernels on discrete struc-
tures. Technical report ucs-crl-99-10, University of Califor-
nia Santa Cruz.
Thomas Hofmann. 2000. Learning probabilistic models of
the web. In Research and Development in Information Re-
trieval.
T. Joachims. 1999. Making large-scale SVM learning practical.
In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances
in Kernel Methods - Support Vector Learning.
J. Kandola, J. Shawe-Taylor, and N. Cristianini. 2002. Learn-
ing semantic similarity. In NIPS’02) - MIT Press.
A. Kontostathis and W. Pottenger. 2002. Improving retrieval
performance with positive and negative equivalence classes
of terms.
Hang Li and Naoki Abe. 1998. Generalizing case frames using
a thesaurus and the mdl principle. Computational Linguis-
tics, 23(3).
Alessandro Moschitti and Roberto Basili. 2004. Complex
linguistic features for text classification: a comprehensive
study. In Proceedings of ECIR’04, Sunderland, UK.
P. Resnik. 1997. Selectional preference and sense disambigua-
tion. In Proceedings of ACL Siglex Workshop on Tagging
Text with Lexical Semantics, Why, What and How?, Wash-
ington, 1997.
Sam Scott and Stan Matwin. 1999. Feature engineering for
text classification. In Proceedings of ICML’99, Bled, SL.
Morgan Kaufmann Publishers, San Francisco, US.
Georges Siolas and Florence d’Alch Buc. 2000. Support vector
machines based on a semantic kernel for text categorization.
In Proceedings of IJCNN’00. IEEE Computer Society.
Alan F. Smeaton. 1999. Using NLP or NLP resources for in-
formation retrieval tasks. In Natural language information
retrieval, Kluwer Academic Publishers, Dordrecht, NL.
M. Sussna. 1993. Word sense disambiguation for free-text in-
dexing using a massive semantic network. In CKIM’93,.
V. Vapnik. 1995. The Nature of Statistical Learning Theory.
Springer.
Ellen M. Voorhees. 1993. Using wordnet to disambiguate word
senses for text retrieval. In Proceedings SIGIR’93 Pitts-
burgh, PA, USA.
Ellen M. Voorhees. 1994. Query expansion using
lexical-semantic relations. In Proceedings of SIGIR’94,
ACM/Springer.
Y. Yang. 1999. An evaluation of statistical approaches to text
categorization. Information Retrieval Journal.
