File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0608_intro.xml
Size: 4,917 bytes
Last Modified: 2025-10-06 14:03:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0608"> <Title>Domain Kernels for Text Categorization</Title> <Section position="3" start_page="0" end_page="56" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text Categorization (TC) deals with the problem of assigning a set of category labels to documents. Categories are usually de ned according to a variety of topics (e.g. SPORT vs. POLITICS) and a set of hand tagged examples is provided for training. In the state-of-the-art TC settings supervised classi ers are used for learning and texts are represented by means of bag-of-words.</Paragraph> <Paragraph position="1"> Even if, in principle, supervised approaches reach the best performance in many Natural Language Processing (NLP) tasks, in practice it is not always easy to apply them to concrete applicative settings.</Paragraph> <Paragraph position="2"> In fact, supervised systems for TC require to be trained a large amount of hand tagged texts. This situation is usually feasible only when there is someone (e.g. a big company) that can easily provide already classi ed documents to train the system.</Paragraph> <Paragraph position="3"> In most of the cases this scenario is quite unpractical, if not infeasible. An example is the task of categorizing personal documents, in which the categories can be modi ed according to the user's interests: new categories are often introduced and, possibly, the available labeled training for them is very limited.</Paragraph> <Paragraph position="4"> In the NLP literature the problem of providing large amounts of manually annotated data is known as the Knowledge Acquisition Bottleneck. Current research in supervised approaches to NLP often deals with de ning methodologies and algorithms to reduce the amount of human effort required for collecting labeled examples.</Paragraph> <Paragraph position="5"> A promising direction to solve this problem is to provide unlabeled data together with labeled texts to help supervision. In the Machine Learning literature this learning schema has been called semi-supervised learning. It has been applied to the TC problem using different techniques: co-training (Blum and Mitchell, 1998), EM-algorithm (Nigam et al., 2000), Transduptive SVM (Joachims, 1999b) and Latent Semantic Indexing (Zelikovitz and Hirsh, 2001).</Paragraph> <Paragraph position="6"> In this paper we propose a novel technique to perform semi-supervised learning for TC. The underlying idea behind our approach is that lexical co- null herence (i.e. co-occurence in texts of semantically related terms) (Magnini et al., 2002) is an inherent property of corpora, and it can be exploited to help a supervised classi er to build a better categorization hypothesis, even if the amount of labeled training data provided for learning is very low.</Paragraph> <Paragraph position="7"> Our proposal consists of de ning a Domain Kernel and exploiting it inside a Support Vector Machine (SVM) classi cation framework for TC (Joachims, 2002). The Domain Kernel relies on the notion of Domain Model, which is a shallow representation for lexical ambiguity and variability. Domain Models can be acquired in an unsupervised way from unlabeled data, and then exploited to dene a Domain Kernel (i.e. a generalized similarity function among documents)1.</Paragraph> <Paragraph position="8"> We evaluated the Domain Kernel in two standard benchmarks for TC (i.e. Reuters and 20Newsgroups), and we compared its performance with a kernel function that exploits a more standard Bag-of-Words (BoW) feature representation. The use of the Domain Kernel got a signi cant improvement in the learning curves of both tasks. In particular, there is a notable increment of the recall, especially with few learning examples. In addition, F1 measure increases by 2.8 points in the Reuters task at full learning, achieving the state-of-the-art results.</Paragraph> <Paragraph position="9"> The paper is structured as follows. Section 2 introduces the notion of Domain Model and describes an automatic acquisition technique based on Latent Semantic Analysis (LSA). In Section 3 we illustrate the SVM approach to TC, and we de ne a Domain Kernel that exploits Domain Models to estimate similarity among documents. In Section 4 the performance of the Domain Kernel are compared with a standard bag-of-words feature representation, showing the improvements in the learning curves. Section 5 describes the previous attempts to exploit semi-supervised learning for TC, while section 6 concludes the paper and proposes some directions for future research.</Paragraph> <Paragraph position="10"> 1The idea of exploiting a Domain Kernel to help a supervised classi cation framework, has been pro tably used also in other NLP tasks such as word sense disambiguation (see for example (Strapparava et al., 2004)).</Paragraph> </Section> class="xml-element"></Paper>