File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0608_metho.xml

Size: 18,823 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0608">
  <Title>Domain Kernels for Text Categorization</Title>
  <Section position="4" start_page="56" end_page="56" type="metho">
    <SectionTitle>
2 Domain Models
</SectionTitle>
    <Paragraph position="0"> The simplest methodology to estimate the similarity among the topics of two texts is to represent them by means of vectors in the Vector Space Model (VSM), and to exploit the cosine similarity. More formally, let T = ft1,t2,... ,tng be a corpus, let V = fw1,w2,... ,wkg be its vocabulary, let T be the k n term-by-document matrix representing T , such that ti,j is the frequency of word wi into the text tj. The VSM is a k-dimensional space Rk, in which the text tj 2 T is represented by means of the vector vectortj such that the ith component of vectortj is ti,j. The similarity among two texts in the VSM is estimated by computing the cosine.</Paragraph>
    <Paragraph position="1"> However this approach does not deal well with lexical variability and ambiguity. For example the two sentences he is affected by AIDS and HIV is a virus do not have any words in common. In the VSM their similarity is zero because they have orthogonal vectors, even if the concepts they express are very closely related. On the other hand, the similarity between the two sentences the laptop has been infected by a virus and HIV is a virus would turn out very high, due to the ambiguity of the word virus.</Paragraph>
    <Paragraph position="2"> To overcome this problem we introduce the notion of Domain Model (DM), and we show how to use it in order to de ne a domain VSM, in which texts and terms are represented in a uniform way.</Paragraph>
    <Paragraph position="3"> A Domain Model is composed by soft clusters of terms. Each cluster represents a semantic domain (Gliozzo et al., 2004), i.e. a set of terms that often co-occur in texts having similar topics. A Domain Model is represented by a k kprime rectangular matrix D, containing the degree of association among terms and domains, as illustrated in Table 1.</Paragraph>
  </Section>
  <Section position="5" start_page="56" end_page="57" type="metho">
    <SectionTitle>
MEDICINE COMPUTER SCIENCE
</SectionTitle>
    <Paragraph position="0"> Domain Models can be used to describe lexical ambiguity and variability. Lexical ambiguity is rep- null resented by associating one term to more than one domain, while variability is represented by associating different terms to the same domain. For example the term virus is associated to both the domain COMPUTER SCIENCE and the domain MEDICINE (ambiguity) while the domain MEDICINE is associated to both the terms AIDS and HIV (variability). More formally, let D = fD1,D2,...,Dkprimeg be a set of domains, such that kprime k. A Domain Model is fully de ned by a k kprime domain matrix D representing in each cell di,z the domain relevance of term wi with respect to the domain Dz.</Paragraph>
    <Paragraph position="1"> The domain matrix D is used to de ne a function D : Rk ! Rkprime, that maps the vectors vectortj, expressed into the classical VSM, into the vectors vectortprimej in the domain VSM. D is de ned by2</Paragraph>
    <Paragraph position="3"> where IIDF is a diagonal matrix such that iIDFi,i = IDF(wi), vectortj is represented as a row vector, and IDF(wi) is the Inverse Document Frequency of wi.</Paragraph>
    <Paragraph position="4"> Vectors in the domain VSM are called Domain Vectors. Domain Vectors for texts are estimated by exploiting formula 1, while the Domain Vector vectorwprimei, corresponding to the word wi 2 V , is the ith row of the domain matrix D. To be a valid domain matrix such vectors should be normalized (i.e. h vectorwprimei, vectorwprimeii = 1).</Paragraph>
    <Paragraph position="5"> In the Domain VSM the similarity among Domain Vectors is estimated by taking into account second order relations among terms. For example the similarity of the two sentences He is affected by AIDS and HIV is a virus is very high, because the terms AIDS, HIV and virus are highly associated to the domain MEDICINE.</Paragraph>
    <Paragraph position="6"> In this work we propose the use of Latent Semantic Analysis (LSA) (Deerwester et al., 1990) to induce Domain Models from corpora. LSA is an unsupervised technique for estimating the similarity among texts and terms in a corpus. LSA is performed by means of a Singular Value Decomposition (SVD) of the term-by-document matrix T describing the corpus. The SVD algorithm can be exploited to acquire a domain matrix D from a large 2In (Wong et al., 1985) a similar schema is adopted to de ne a Generalized Vector Space Model, of which the Domain VSM is a particular instance.</Paragraph>
    <Paragraph position="7"> corpus T in a totally unsupervised way. SVD decomposes the term-by-document matrix T into three matrixes T ' VSkprimeUT where Skprime is the diagonal k k matrix containing the highest kprime k eigenvalues of T, and all the remaining elements set to 0. The parameter kprime is the dimensionality of the Domain VSM and can be xed in advance3. Under this setting we de ne the domain matrix DLSA4 as</Paragraph>
    <Paragraph position="9"> where IN is a diagonal matrix such that iNi,i =</Paragraph>
  </Section>
  <Section position="6" start_page="57" end_page="58" type="metho">
    <SectionTitle>
3 The Domain Kernel
</SectionTitle>
    <Paragraph position="0"> Kernel Methods are the state-of-the-art supervised framework for learning, and they have been successfully adopted to approach the TC task (Joachims, 1999a).</Paragraph>
    <Paragraph position="1"> The basic idea behind kernel methods is to embed the data into a suitable feature space F via a mapping function ph : X ! F, and then use a linear algorithm for discovering nonlinear patterns. Kernel methods allow us to build a modular system, as the kernel function acts as an interface between the data and the learning algorithm. Thus the kernel function becomes the only domain speci c module of the system, while the learning algorithm is a general purpose component. Potentially a kernel function can work with any kernel-based algorithm, such as for example SVM.</Paragraph>
    <Paragraph position="2"> During the learning phase SVMs assign a weight li 0 to any example xi 2 X. All the labeled instances xi such that li &gt; 0 are called support vectors. The support vectors lie close to the best separating hyper-plane between positive and negative examples. New examples are then assigned to the class of its closest support vectors, according to equation  is equivalent to a Latent Semantic Space (Deerwester et al., 1990). The only difference in our formulation is that the vectors representing the terms in the Domain VSM are normalized by the matrix IN, and then rescaled, according to their IDF value, by matrix IIDF. Note the analogy with the tf idf term weighting schema (Salton and McGill, 1983), widely adopted in Information Retrieval.</Paragraph>
    <Paragraph position="4"> The kernel function K returns the similarity between two instances in the input space X, and can be designed in order to capture the relevant aspects to estimate similarity, just by taking care of satisfying set of formal requirements, as described in (Schcurrency1olkopf and Smola, 2001).</Paragraph>
    <Paragraph position="5"> In this paper we de ne the Domain Kernel and we apply it to TC tasks. The Domain Kernel, denoted by KD, can be exploited to estimate the topic similarity among two texts while taking into account the external knowledge provided by a Domain Model (see section 2). It is a variation of the Latent Semantic Kernel (Shawe-Taylor and Cristianini, 2004), in which a Domain Model is exploited to de ne an explicit mapping D : Rk ! Rkprime from the classical VSM into the domain VSM. The Domain Kernel is de ned by</Paragraph>
    <Paragraph position="7"> where D is the Domain Mapping de ned in equation 1. To be fully de ned, the Domain Kernel requires a Domain Matrix D. In principle, D can be acquired from any corpora by exploiting any (soft) term clustering algorithm. Anyway, we belive that adequate Domain Models for particular tasks can be better acquired from collections of documents from the same source. For this reason, for the experiments reported in this paper, we acquired the matrix DLSA, de ned by equation 2, using the whole (unlabeled) training corpora available for each task, so tuning the Domain Model on the particular task in which it will be applied.</Paragraph>
    <Paragraph position="8"> A more traditional approach to measure topic similarity among text consists of extracting BoW features and to compare them in a vector space. The BoW kernel, denoted by KBoW , is a particular case of the Domain Kernel, in which D = I, and I is the identity matrix. The BoW Kernel does not require a Domain Model, so we can consider this setting as purely supervised, in which no external knowledge source is provided.</Paragraph>
  </Section>
  <Section position="7" start_page="58" end_page="60" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We compared the performance of both KD and KBoW on two standard TC benchmarks. In sub-section 4.1 we describe the evaluation tasks and the preprocessing steps, in 4.2 we describe some algorithmic details of the TC system adopted. Finally in subsection 4.3 we compare the learning curves of KD and KBoW .</Paragraph>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
4.1 Text Categorization tasks
</SectionTitle>
      <Paragraph position="0"> For the experiments reported in this paper, we selected two evaluation benchmarks typically used in the TC literature (Sebastiani, 2002): the 20newsgroups and the Reuters corpora. In both the data sets we tagged the texts for part of speech and we considered only the noun, verb, adjective, and adverb parts of speech, representing them by vectors containing the frequencies of each disambiguated lemma. The only feature selection step we performed was to remove all the closed-class words from the document index.</Paragraph>
      <Paragraph position="1"> 20newsgroups. The 20Newsgroups data set5 is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. This collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classi cation and text clustering. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g. misc.forsale / soc.religion.christian). We removed cross-posts (duplicates), newsgroup-identifying headers (i.e. Xref, Newsgroups, Path, Followup-To, Date), and empty documents from the original corpus, so to obtain 18,941 documents. Then we randomly divided it into training (80%) and test (20%) sets, containing respectively 15,153 and  partitions according to the standard ModApt e split. It includes 12,902 documents for 90 categories, with a xed splitting between training and test data. We conducted our experiments by considering only the 10 most frequent categories, i.e. Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat and Corn, and we included in our dataset all the non empty documents labeled with at least one of those categories. Thus the nal dataset includes 9295 document, of which 6680 are included in the training partition, and 2615 are in the test set.</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.2 Implementation details
</SectionTitle>
      <Paragraph position="0"> As a supervised learning device, we used the SVM implementation described in (Joachims, 1999a).</Paragraph>
      <Paragraph position="1"> The Domain Kernel is implemented by de ning an explicit feature mapping according to formula 1, and by normalizing each vector to obtain vectors of unitary length. All the experiments have been performed on the standard parameter settings, using a linear kernel.</Paragraph>
      <Paragraph position="2"> We acquired a different Domain Model for each corpus by performing the SVD processes on the term-by-document matrices representing the whole training partitions, and we considered only the rst 400 domains (i.e. kprime = 400)7.</Paragraph>
      <Paragraph position="3"> As far as the Reuters task is concerned, the TC problem has been approached as a set of binary ltering problems, allowing the TC system to provide more than one category label to each document. For the 20newsgroups task, we implemented a one-versus-all classi cation schema, in order to assign a single category to each news.</Paragraph>
    </Section>
    <Section position="3" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
4.3 Domain Kernel versus BoW Kernel
</SectionTitle>
      <Paragraph position="0"> Figure 1 and Figure 2 report the learning curves for both KD and KBoW , evaluated respectively on the Reuters and the 20newgroups task. Results clearly show that KD always outperforms KBoW , especially when very limited amount of labeled data is provided for learning.</Paragraph>
      <Paragraph position="1"> 7To perform the SVD operation we adopted LIBSVDC, an optimized package for sparse matrix that allows to perform this step in few minutes even for large corpora. It can be downloaded from  Table 2 compares the performances of the two kernels at full learning. KD achieves a better micro-F1 than KBoW in both tasks. The improvement is particularly signi cant in the Reuters task (+ 2.8 %). Tables 3 shows the number of labeled examples required by KD and KBoW to achieve the same micro-F1 in the Reuters task. KD requires only 146 examples to obtain a micro-F1 of 0.84, while KBoW requires 1380 examples to achieve the same performance. In the same task, KD surpass the performance of KBoW at full learning using only the 10% of the labeled data. The last column of the table shows clearly that KD requires 90% less labeled data than KBoW to achieve the same performances.</Paragraph>
      <Paragraph position="2"> A similar behavior is reported in Table 4 for the</Paragraph>
    </Section>
    <Section position="4" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
Reuters task
</SectionTitle>
      <Paragraph position="0"> 20newsgroups task. It is important to notice that the number of labeled documents is higher in this corpus than in the previous one. The bene ts of using Domain Models are then less evident at full learning, even if they are signi cant when very few labeled data are provided.</Paragraph>
      <Paragraph position="1"> Figures 3 and 4 report a more detailed analysis by comparing the micro-precision and micro-recall learning curves of both kernels in the Reuters task8. It is clear from the graphs that the main contribute of KD is about increasing recall, while precision is similar in both cases9. This last result con rms our hypothesis that the information provided by the Domain Models allows the system to generalize in a more effective way over the training examples, allowing to estimate the similarity among texts even if they have just few words in common.</Paragraph>
      <Paragraph position="2"> Finally, KD achieves the state-of-the-art in the Reuters task, as reported in section 5.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="60" end_page="61" type="metho">
    <SectionTitle>
5 Related Works
</SectionTitle>
    <Paragraph position="0"> To our knowledge, the rst attempt to apply the semi-supervised learning schema to TC has been reported in (Blum and Mitchell, 1998). Their co-training algorithm was able to reduce signi cantly the error rate, if compared to a strictly supervised 8For the 20-newsgroups task both micro-precision and micro-recall are equal to micro-F1 because a single category label has been assigned to every instance.</Paragraph>
    <Paragraph position="1"> 9It is worth noting thatKD gets a F1 measure of 0.54 (Precision/Recall of 0.93/0.38) using just 14 training examples, suggesting that it can be pro tably exploited for a bootstrapping process.</Paragraph>
    <Paragraph position="2">  (Nigam et al., 2000) adopted an Expectation Maximization (EM) schema to deal with the same problem, evaluating extensively their approach on several datasets. They compared their algorithm with a standard probabilistic approach to TC, reporting substantial improvements in the learning curve.</Paragraph>
    <Paragraph position="3">  A similar evaluation is also reported in (Joachims, 1999b), where a transduptive SVM is compared to a state-of-the-art TC classi er based on SVM.</Paragraph>
    <Paragraph position="4"> The semi-supervised approach obtained better results than the standard with few learning data, while at full learning results seem to converge.</Paragraph>
    <Paragraph position="5"> (Bekkerman et al., 2002) adopted a SVM classier in which texts have been represented by their associations to a set of Distributional Word Clusters. Even if this approach is very similar to ours, it is not a semi-supervised learning schema, because authors did not exploit any additional unlabeled data to induce word clusters.</Paragraph>
    <Paragraph position="6"> In (Zelikovitz and Hirsh, 2001) background knowledge (i.e. the unlabeled data) is exploited together with labeled data to estimate document similarity in a Latent Semantic Space (Deerwester et al., 1990). Their approach differs from the one proposed in this paper because a different categorization algorithm has been adopted. Authors compared their algorithm with an EM schema (Nigam et al., 2000) on the same dataset, reporting better results only with very few labeled data, while EM performs better with more training.</Paragraph>
    <Paragraph position="7"> All the semi-supervised approaches in the literature reports better results than strictly supervised ones with few learning, while with more data the learning curves tend to converge.</Paragraph>
    <Paragraph position="8"> A comparative evaluation among semi-supervised TC algorithms is quite dif cult, because the used data sets, the preprocessing steps and the splitting partitions adopted affect sensibly the nal results.</Paragraph>
    <Paragraph position="9"> Anyway, we reported the best F1 measure on the Reuters corpus: to our knowledge, the state-of-the-art on the 10 top most frequent categories of the ModApte split at full learning is F1 92.0 (Bekkerman et al., 2002) while we obtained 92.8. It is important to notice here that this results has been obtained thanks to the improvements of the Domain Kernel. In addition, on the 20newsgroups task, our methods requires about 100 documents (i.e. ve documents per category) to achieve 70% F1, while both EM (Nigam et al., 2000) and LSI (Zelikovitz and Hirsh, 2001) requires more than 400 to achieve the same performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML