Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),
pages 56–63, Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Domain Kernels for Text Categorization
Al o Gliozzo and Carlo Strapparava
ITC-Irst
via Sommarive, I-38050, Trento, ITALY
{gliozzo,strappa}@itc.it
Abstract
In this paper we propose and evaluate
a technique to perform semi-supervised
learning for Text Categorization. In
particular we de ned a kernel function,
namely the Domain Kernel, that allowed
us to plug  external knowledge into the
supervised learning process. External
knowledge is acquired from unlabeled
data in a totally unsupervised way, and it
is represented by means of Domain Mod-
els.
We evaluated the Domain Kernel in two
standard benchmarks for Text Categoriza-
tion with good results, and we compared
its performance with a kernel function that
exploits a standard bag-of-words feature
representation. The learning curves show
that the Domain Kernel allows us to re-
duce drastically the amount of training
data required for learning.
1 Introduction
Text Categorization (TC) deals with the problem of
assigning a set of category labels to documents. Cat-
egories are usually de ned according to a variety
of topics (e.g. SPORT vs. POLITICS) and a set of
hand tagged examples is provided for training. In the
state-of-the-art TC settings supervised classi ers are
used for learning and texts are represented by means
of bag-of-words.
Even if, in principle, supervised approaches reach
the best performance in many Natural Language
Processing (NLP) tasks, in practice it is not always
easy to apply them to concrete applicative settings.
In fact, supervised systems for TC require to be
trained a large amount of hand tagged texts. This
situation is usually feasible only when there is some-
one (e.g. a big company) that can easily provide al-
ready classi ed documents to train the system.
In most of the cases this scenario is quite unprac-
tical, if not infeasible. An example is the task of
categorizing personal documents, in which the cate-
gories can be modi ed according to the user’s inter-
ests: new categories are often introduced and, pos-
sibly, the available labeled training for them is very
limited.
In the NLP literature the problem of providing
large amounts of manually annotated data is known
as the Knowledge Acquisition Bottleneck. Cur-
rent research in supervised approaches to NLP often
deals with de ning methodologies and algorithms to
reduce the amount of human effort required for col-
lecting labeled examples.
A promising direction to solve this problem is to
provide unlabeled data together with labeled texts
to help supervision. In the Machine Learning lit-
erature this learning schema has been called semi-
supervised learning. It has been applied to the
TC problem using different techniques: co-training
(Blum and Mitchell, 1998), EM-algorithm (Nigam
et al., 2000), Transduptive SVM (Joachims, 1999b)
and Latent Semantic Indexing (Zelikovitz and Hirsh,
2001).
In this paper we propose a novel technique to per-
form semi-supervised learning for TC. The under-
lying idea behind our approach is that lexical co-
56
herence (i.e. co-occurence in texts of semantically
related terms) (Magnini et al., 2002) is an inherent
property of corpora, and it can be exploited to help a
supervised classi er to build a better categorization
hypothesis, even if the amount of labeled training
data provided for learning is very low.
Our proposal consists of de ning a Domain
Kernel and exploiting it inside a Support Vector
Machine (SVM) classi cation framework for TC
(Joachims, 2002). The Domain Kernel relies on the
notion of Domain Model, which is a shallow repre-
sentation for lexical ambiguity and variability. Do-
main Models can be acquired in an unsupervised
way from unlabeled data, and then exploited to de-
 ne a Domain Kernel (i.e. a generalized similarity
function among documents)1.
We evaluated the Domain Kernel in two stan-
dard benchmarks for TC (i.e. Reuters and 20News-
groups), and we compared its performance with a
kernel function that exploits a more standard Bag-
of-Words (BoW) feature representation. The use of
the Domain Kernel got a signi cant improvement in
the learning curves of both tasks. In particular, there
is a notable increment of the recall, especially with
few learning examples. In addition, F1 measure in-
creases by 2.8 points in the Reuters task at full learn-
ing, achieving the state-of-the-art results.
The paper is structured as follows. Section 2 in-
troduces the notion of Domain Model and describes
an automatic acquisition technique based on Latent
Semantic Analysis (LSA). In Section 3 we illustrate
the SVM approach to TC, and we de ne a Domain
Kernel that exploits Domain Models to estimate sim-
ilarity among documents. In Section 4 the perfor-
mance of the Domain Kernel are compared with a
standard bag-of-words feature representation, show-
ing the improvements in the learning curves. Section
5 describes the previous attempts to exploit semi-
supervised learning for TC, while section 6 con-
cludes the paper and proposes some directions for
future research.
1The idea of exploiting a Domain Kernel to help a super-
vised classi cation framework, has been pro tably used also in
other NLP tasks such as word sense disambiguation (see for ex-
ample (Strapparava et al., 2004)).
2 Domain Models
The simplest methodology to estimate the similar-
ity among the topics of two texts is to represent
them by means of vectors in the Vector Space Model
(VSM), and to exploit the cosine similarity. More
formally, let T = ft1,t2,... ,tng be a corpus, let
V = fw1,w2,... ,wkg be its vocabulary, let T be
the k  n term-by-document matrix representing T ,
such that ti,j is the frequency of word wi into the text
tj. The VSM is a k-dimensional space Rk, in which
the text tj 2 T is represented by means of the vec-
tor vectortj such that the ith component of vectortj is ti,j. The
similarity among two texts in the VSM is estimated
by computing the cosine.
However this approach does not deal well with
lexical variability and ambiguity. For example the
two sentences  he is affected by AIDS and  HIV is
a virus do not have any words in common. In the
VSM their similarity is zero because they have or-
thogonal vectors, even if the concepts they express
are very closely related. On the other hand, the sim-
ilarity between the two sentences  the laptop has
been infected by a virus and  HIV is a virus would
turn out very high, due to the ambiguity of the word
virus.
To overcome this problem we introduce the notion
of Domain Model (DM), and we show how to use it
in order to de ne a domain VSM, in which texts and
terms are represented in a uniform way.
A Domain Model is composed by soft clusters of
terms. Each cluster represents a semantic domain
(Gliozzo et al., 2004), i.e. a set of terms that often
co-occur in texts having similar topics. A Domain
Model is represented by a k  kprime rectangular matrix
D, containing the degree of association among terms
and domains, as illustrated in Table 1.
MEDICINE COMPUTER SCIENCE
HIV 1 0
AIDS 1 0
virus 0.5 0.5
laptop 0 1
Table 1: Example of Domain Matrix
Domain Models can be used to describe lexical
ambiguity and variability. Lexical ambiguity is rep-
57
resented by associating one term to more than one
domain, while variability is represented by associat-
ing different terms to the same domain. For example
the term virus is associated to both the domain
COMPUTER SCIENCE and the domain MEDICINE
(ambiguity) while the domain MEDICINE is associ-
ated to both the terms AIDS and HIV (variability).
More formally, let D = fD1,D2,...,Dkprimeg be
a set of domains, such that kprime  k. A Domain
Model is fully de ned by a k  kprime domain matrix
D representing in each cell di,z the domain rele-
vance of term wi with respect to the domain Dz.
The domain matrix D is used to de ne a function
D : Rk ! Rkprime, that maps the vectors vectortj, expressed
into the classical VSM, into the vectors vectortprimej in the do-
main VSM. D is de ned by2
D(vectortj) = vectortj(IIDFD) = vectortprimej (1)
where IIDF is a diagonal matrix such that iIDFi,i =
IDF(wi), vectortj is represented as a row vector, and
IDF(wi) is the Inverse Document Frequency of wi.
Vectors in the domain VSM are called Domain
Vectors. Domain Vectors for texts are estimated by
exploiting formula 1, while the Domain Vector vectorwprimei,
corresponding to the word wi 2 V , is the ith row of
the domain matrix D. To be a valid domain matrix
such vectors should be normalized (i.e. h vectorwprimei, vectorwprimeii =
1).
In the Domain VSM the similarity among Domain
Vectors is estimated by taking into account second
order relations among terms. For example the simi-
larity of the two sentences  He is affected by AIDS 
and  HIV is a virus is very high, because the terms
AIDS, HIV and virus are highly associated to the
domain MEDICINE.
In this work we propose the use of Latent Se-
mantic Analysis (LSA) (Deerwester et al., 1990) to
induce Domain Models from corpora. LSA is an
unsupervised technique for estimating the similar-
ity among texts and terms in a corpus. LSA is per-
formed by means of a Singular Value Decomposi-
tion (SVD) of the term-by-document matrix T de-
scribing the corpus. The SVD algorithm can be ex-
ploited to acquire a domain matrix D from a large
2In (Wong et al., 1985) a similar schema is adopted to de ne
a Generalized Vector Space Model, of which the Domain VSM
is a particular instance.
corpus T in a totally unsupervised way. SVD de-
composes the term-by-document matrix T into three
matrixes T ’ VΣkprimeUT where Σkprime is the diagonal
k  k matrix containing the highest kprime  k eigen-
values of T, and all the remaining elements set to
0. The parameter kprime is the dimensionality of the Do-
main VSM and can be  xed in advance3. Under this
setting we de ne the domain matrix DLSA4 as
DLSA = INVradicalbigΣkprime (2)
where IN is a diagonal matrix such that iNi,i =
1radicalBig
〈 vectorwprimei, vectorwprimei〉
, vectorwprimei is the ith row of the matrix VpΣkprime.
3 The Domain Kernel
Kernel Methods are the state-of-the-art supervised
framework for learning, and they have been success-
fully adopted to approach the TC task (Joachims,
1999a).
The basic idea behind kernel methods is to embed
the data into a suitable feature space F via a map-
ping function φ : X ! F, and then use a linear
algorithm for discovering nonlinear patterns. Kernel
methods allow us to build a modular system, as the
kernel function acts as an interface between the data
and the learning algorithm. Thus the kernel function
becomes the only domain speci c module of the sys-
tem, while the learning algorithm is a general pur-
pose component. Potentially a kernel function can
work with any kernel-based algorithm, such as for
example SVM.
During the learning phase SVMs assign a weight
λi  0 to any example xi 2 X. All the labeled
instances xi such that λi > 0 are called support vec-
tors. The support vectors lie close to the best sepa-
rating hyper-plane between positive and negative ex-
amples. New examples are then assigned to the class
of its closest support vectors, according to equation
3.
3It is not clear how to choose the right dimensionality. In
our experiments we used 400 dimensions.
4When DLSA is substituted in Equation 1 the Domain VSM
is equivalent to a Latent Semantic Space (Deerwester et al.,
1990). The only difference in our formulation is that the vectors
representing the terms in the Domain VSM are normalized by
the matrix IN, and then rescaled, according to their IDF value,
by matrix IIDF. Note the analogy with the tf idf term weighting
schema (Salton and McGill, 1983), widely adopted in Informa-
tion Retrieval.
58
f(x) =
nsummationdisplay
i=1
λiK(xi,x) + λ0 (3)
The kernel function K returns the similarity be-
tween two instances in the input space X, and can
be designed in order to capture the relevant aspects
to estimate similarity, just by taking care of satis-
fying set of formal requirements, as described in
(Schcurrency1olkopf and Smola, 2001).
In this paper we de ne the Domain Kernel and we
apply it to TC tasks. The Domain Kernel, denoted
by KD, can be exploited to estimate the topic simi-
larity among two texts while taking into account the
external knowledge provided by a Domain Model
(see section 2). It is a variation of the Latent Seman-
tic Kernel (Shawe-Taylor and Cristianini, 2004), in
which a Domain Model is exploited to de ne an ex-
plicit mapping D : Rk ! Rkprime from the classical
VSM into the domain VSM. The Domain Kernel is
de ned by
KD(ti,tj) = hD(ti),D(tj)iradicalBig
hD(tj),D(tj)ihD(ti),D(ti)i
(4)
where D is the Domain Mapping de ned in equa-
tion 1. To be fully de ned, the Domain Kernel re-
quires a Domain Matrix D. In principle, D can be
acquired from any corpora by exploiting any (soft)
term clustering algorithm. Anyway, we belive that
adequate Domain Models for particular tasks can be
better acquired from collections of documents from
the same source. For this reason, for the experi-
ments reported in this paper, we acquired the matrix
DLSA, de ned by equation 2, using the whole (un-
labeled) training corpora available for each task, so
tuning the Domain Model on the particular task in
which it will be applied.
A more traditional approach to measure topic sim-
ilarity among text consists of extracting BoW fea-
tures and to compare them in a vector space. The
BoW kernel, denoted by KBoW , is a particular case
of the Domain Kernel, in which D = I, and I is the
identity matrix. The BoW Kernel does not require
a Domain Model, so we can consider this setting
as  purely supervised, in which no external knowl-
edge source is provided.
4 Evaluation
We compared the performance of both KD and
KBoW on two standard TC benchmarks. In sub-
section 4.1 we describe the evaluation tasks and the
preprocessing steps, in 4.2 we describe some algo-
rithmic details of the TC system adopted. Finally
in subsection 4.3 we compare the learning curves of
KD and KBoW .
4.1 Text Categorization tasks
For the experiments reported in this paper, we se-
lected two evaluation benchmarks typically used in
the TC literature (Sebastiani, 2002): the 20news-
groups and the Reuters corpora. In both the data sets
we tagged the texts for part of speech and we consid-
ered only the noun, verb, adjective, and adverb parts
of speech, representing them by vectors containing
the frequencies of each disambiguated lemma. The
only feature selection step we performed was to re-
move all the closed-class words from the document
index.
20newsgroups. The 20Newsgroups data set5 is
a collection of approximately 20,000 newsgroup
documents, partitioned (nearly) evenly across 20
different newsgroups. This collection has become
a popular data set for experiments in text appli-
cations of machine learning techniques, such as
text classi cation and text clustering. Some of
the newsgroups are very closely related to each
other (e.g. comp.sys.ibm.pc.hardware
/ comp.sys.mac.hardware), while others
are highly unrelated (e.g. misc.forsale /
soc.religion.christian). We removed
cross-posts (duplicates), newsgroup-identifying
headers (i.e. Xref, Newsgroups, Path, Followup-To,
Date), and empty documents from the original
corpus, so to obtain 18,941 documents. Then we
randomly divided it into training (80%) and test
(20%) sets, containing respectively 15,153 and
3,788 documents.
Reuters. We used the Reuters-21578 collec-
tion6, and we splitted it into training and test
5Available at http://www.ai.mit.edu-
/people/jrennie/20Newsgroups/.
6Available athttp://kdd.ics.uci.edu/databases/-
reuters21578/reuters21578.html.
59
partitions according to the standard ModApt e
split. It includes 12,902 documents for 90 cat-
egories, with a  xed splitting between training
and test data. We conducted our experiments by
considering only the 10 most frequent categories,
i.e. Earn, Acquisition, Money-fx,
Grain, Crude, Trade, Interest,
Ship, Wheat and Corn, and we included in
our dataset all the non empty documents labeled
with at least one of those categories. Thus the  nal
dataset includes 9295 document, of which 6680 are
included in the training partition, and 2615 are in
the test set.
4.2 Implementation details
As a supervised learning device, we used the SVM
implementation described in (Joachims, 1999a).
The Domain Kernel is implemented by de ning an
explicit feature mapping according to formula 1, and
by normalizing each vector to obtain vectors of uni-
tary length. All the experiments have been per-
formed on the standard parameter settings, using a
linear kernel.
We acquired a different Domain Model for each
corpus by performing the SVD processes on the
term-by-document matrices representing the whole
training partitions, and we considered only the  rst
400 domains (i.e. kprime = 400)7.
As far as the Reuters task is concerned, the TC
problem has been approached as a set of binary  l-
tering problems, allowing the TC system to pro-
vide more than one category label to each document.
For the 20newsgroups task, we implemented a one-
versus-all classi cation schema, in order to assign a
single category to each news.
4.3 Domain Kernel versus BoW Kernel
Figure 1 and Figure 2 report the learning curves for
both KD and KBoW , evaluated respectively on the
Reuters and the 20newgroups task. Results clearly
show that KD always outperforms KBoW , espe-
cially when very limited amount of labeled data is
provided for learning.
7To perform the SVD operation we adopted
LIBSVDC, an optimized package for sparse ma-
trix that allows to perform this step in few minutes
even for large corpora. It can be downloaded from
http://tedlab.mit.edu/∼dr/SVDLIBC/.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measure
Fraction of labeled training data
Domain Kernel
BoW Kernel
Figure 1: Micro-F1 learning curves for Reuters
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measur
e
Fraction of labeled training data
Domain Kernel
BoW Kernel
Figure 2: Micro-F1 learning curves for 20news-
groups
Table 2 compares the performances of the two
kernels at full learning. KD achieves a better micro-
F1 than KBoW in both tasks. The improvement is
particularly signi cant in the Reuters task (+ 2.8 %).
Tables 3 shows the number of labeled examples
required by KD and KBoW to achieve the same
micro-F1 in the Reuters task. KD requires only
146 examples to obtain a micro-F1 of 0.84, while
KBoW requires 1380 examples to achieve the same
performance. In the same task, KD surpass the per-
formance of KBoW at full learning using only the
10% of the labeled data. The last column of the ta-
ble shows clearly that KD requires 90% less labeled
data than KBoW to achieve the same performances.
A similar behavior is reported in Table 4 for the
60
F1 Domain Kernel Bow Kernel
Reuters 0.928 0.900
20newsgroups 0.886 0.880
Table 2: Micro-F1 with full learning
F1 Domain Kernel Bow Kernel Ratio
.54 14 267 5%
.84 146 1380 10%
.90 668 6680 10%
Table 3: Number of training examples needed by
KD and KBoW to reach the same micro-F1 on the
Reuters task
20newsgroups task. It is important to notice that the
number of labeled documents is higher in this corpus
than in the previous one. The bene ts of using Do-
main Models are then less evident at full learning,
even if they are signi cant when very few labeled
data are provided.
Figures 3 and 4 report a more detailed analysis
by comparing the micro-precision and micro-recall
learning curves of both kernels in the Reuters task8.
It is clear from the graphs that the main contribute
of KD is about increasing recall, while precision is
similar in both cases9. This last result con rms our
hypothesis that the information provided by the Do-
main Models allows the system to generalize in a
more effective way over the training examples, al-
lowing to estimate the similarity among texts even if
they have just few words in common.
Finally, KD achieves the state-of-the-art in the
Reuters task, as reported in section 5.
5 Related Works
To our knowledge, the  rst attempt to apply the
semi-supervised learning schema to TC has been
reported in (Blum and Mitchell, 1998). Their co-
training algorithm was able to reduce signi cantly
the error rate, if compared to a strictly supervised
8For the 20-newsgroups task both micro-precision and
micro-recall are equal to micro-F1 because a single category
label has been assigned to every instance.
9It is worth noting thatKD gets a F1 measure of 0.54 (Preci-
sion/Recall of 0.93/0.38) using just 14 training examples, sug-
gesting that it can be pro tably exploited for a bootstrapping
process.
F1 Domain Kernel Bow Kernel Ratio
.50 30 500 6%
.70 98 1182 8%
.85 2272 7879 29%
Table 4: Number of training examples needed by
KD and KBoW to reach the same micro-F1 on the
20newsgroups task
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Fraction of labeled training data
Domain Kernel
BoW Kernel
Figure 3: Learning curves for Reuters (Precision)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Fraction of labeled training data
Domain Kernel
BoW Kernel
Figure 4: Learning curves for Reuters (Recall)
classi er.
(Nigam et al., 2000) adopted an Expectation Max-
imization (EM) schema to deal with the same prob-
lem, evaluating extensively their approach on sev-
eral datasets. They compared their algorithm with
a standard probabilistic approach to TC, reporting
substantial improvements in the learning curve.
61
A similar evaluation is also reported in (Joachims,
1999b), where a transduptive SVM is compared
to a state-of-the-art TC classi er based on SVM.
The semi-supervised approach obtained better re-
sults than the standard with few learning data, while
at full learning results seem to converge.
(Bekkerman et al., 2002) adopted a SVM classi-
 er in which texts have been represented by their as-
sociations to a set of Distributional Word Clusters.
Even if this approach is very similar to ours, it is not
a semi-supervised learning schema, because authors
did not exploit any additional unlabeled data to in-
duce word clusters.
In (Zelikovitz and Hirsh, 2001) background
knowledge (i.e. the unlabeled data) is exploited to-
gether with labeled data to estimate document sim-
ilarity in a Latent Semantic Space (Deerwester et
al., 1990). Their approach differs from the one pro-
posed in this paper because a different categoriza-
tion algorithm has been adopted. Authors compared
their algorithm with an EM schema (Nigam et al.,
2000) on the same dataset, reporting better results
only with very few labeled data, while EM performs
better with more training.
All the semi-supervised approaches in the liter-
ature reports better results than strictly supervised
ones with few learning, while with more data the
learning curves tend to converge.
A comparative evaluation among semi-supervised
TC algorithms is quite dif cult, because the used
data sets, the preprocessing steps and the splitting
partitions adopted affect sensibly the  nal results.
Anyway, we reported the best F1 measure on the
Reuters corpus: to our knowledge, the state-of-the-
art on the 10 top most frequent categories of the
ModApte split at full learning is F1 92.0 (Bekker-
man et al., 2002) while we obtained 92.8. It is im-
portant to notice here that this results has been ob-
tained thanks to the improvements of the Domain
Kernel. In addition, on the 20newsgroups task, our
methods requires about 100 documents (i.e.  ve
documents per category) to achieve 70% F1, while
both EM (Nigam et al., 2000) and LSI (Zelikovitz
and Hirsh, 2001) requires more than 400 to achieve
the same performance.
6 Conclusion and Future Works
In this paper a novel technique to perform semi-
supervised learning for TC has been proposed and
evaluated. We de ned a Domain Kernel that allows
us to improve the similarity estimation among docu-
ments by exploiting Domain Models. Domain Mod-
els are acquired from large collections of non anno-
tated texts in a totally unsupervised way.
An extensive evaluation on two standard bench-
marks shows that the Domain Kernel allows us to re-
duce drastically the amount of training data required
for learning. In particular the recall increases sen-
sibly, while preserving a very good accuracy. We
explained this phenomenon by showing that the sim-
ilarity scores evaluated by the Domain Kernel takes
into account both variability and ambiguity, being
able to estimate similarity even among texts that do
not have any word in common.
As future work, we plan to apply our semi-
supervised learning method to some concrete ap-
plicative scenarios, such as user modeling and cat-
egorization of personal documents in mail clients.
In addition, we are going deeper in the direction of
semi-supervised learning, by acquiring more com-
plex structures than clusters (e.g. synonymy, hyper-
onymy) to represent domain models. Furthermore,
we are working to adapt the general framework pro-
vided by the Domain Models to a multilingual sce-
nario, in order to apply the Domain Kernel to a Cross
Language TC task.
Acknowledgments
This work has been partially supported by the ON-
TOTEXT (From Text to Knowledge for the Se-
mantic Web) project, funded by the Autonomous
Province of Trento under the FUP-2004 research
program.

References
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Win-
ter. 2002. Distributional word clusters vs. words for
text categorization. Journal of Machine Learning Re-
search, 1:1183 1208.
A. Blum and T. Mitchell. 1998. Combining labeled and
unlabeled data with co-training. In COLT: Proceed-
ings of the Workshop on Computational Learning The-
ory, Morgan Kaufmann Publishers.
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and
R. Harshman. 1990. Indexing by latent semantic anal-
ysis. Journal of the American Society of Information
Science.
A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsu-
pervised and supervised exploitation of semantic do-
mains in lexical disambiguation. Computer Speech
and Language, 18:275 299.
T. Joachims. 1999a. Making large-scale SVM learning
practical. In B. Schcurrency1olkopf, C. Burges, and A. Smola,
editors, Advances in kernel methods: support vector
learning, chapter 11, pages 169  184. MIT Press,
Cambridge, MA, USA.
T. Joachims. 1999b. Transductive inference for text
classi cation using support vector machines. In Pro-
ceedings of ICML-99, 16th International Conference
on Machine Learning, pages 200 209. Morgan Kauf-
mann Publishers, San Francisco, US.
T. Joachims. 2002. Learning to Classify Text using Sup-
port Vector Machines. Kluwer Academic Publishers.
B. Magnini, C. Strapparava, G. Pezzulo, and A. Gliozzo.
2002. The role of domain information in word
sense disambiguation. Natural Language Engineer-
ing, 8(4):359 373.
K. Nigam, A. K. McCallum, S. Thrun, and T. M.
Mitchell. 2000. Text classi cation from labeled and
unlabeled documents using EM. Machine Learning,
39(2/3):103 134.
G. Salton and M.H. McGill. 1983. Introduction to mod-
ern information retrieval. McGraw-Hill, New York.
B. Schcurrency1olkopf and A. J. Smola. 2001. Learning with Ker-
nels. Support Vector Machines, Regularization, Opti-
mization, and Beyond. MIT Press.
F. Sebastiani. 2002. Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1 47.
J. Shawe-Taylor and N. Cristianini. 2004. Kernel Meth-
ods for Pattern Analysis. Cambridge University Press.
C. Strapparava, A. Gliozzo, and C. Giuliano. 2004. Pat-
tern abstraction and term similarity for word sense
disambiguation: Irst at senseval-3. In Proc. of
SENSEVAL-3 Third International Workshop on Eval-
uation of Systems for the Semantic Analysis of Text,
pages 229 234, Barcelona, Spain, July.
S.K.M. Wong, W. Ziarko, and P.C.N. Wong. 1985. Gen-
eralized vector space model in information retrieval.
In Proceedings of the 8th ACM SIGIR Conference.
S. Zelikovitz and H. Hirsh. 2001. Using LSI for text clas-
si cation in the presence of background text. In Hen-
rique Paques, Ling Liu, and David Grossman, editors,
Proceedings of CIKM-01, 10th ACM International
Conference on Information and Knowledge Manage-
ment, pages 113 118, Atlanta, US. ACM Press, New
York, US.
