File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1070_metho.xml
Size: 22,432 bytes
Last Modified: 2025-10-06 14:08:40
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1070"> <Title>Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Random Indexing </SectionTitle> <Paragraph position="0"> One serious problem with BoC approaches is that they tend to be computationally expensive.</Paragraph> <Paragraph position="1"> This is true at least for methods that use factor analytic techniques. Other BoC approaches that use resources such as WordNet have limited portability, and are normally not easily adaptable to other domains and to other languages.</Paragraph> <Paragraph position="2"> To overcome these problems, we have developedanalternativeapproachforproducingBoC null representations. The approach is based on Random Indexing (Kanerva et al., 2000; Karlgren and Sahlgren, 2001), which is a vector space methodology for producing context vectors3 for words based on cooccurrence data. The context vectors can be used to produce BoC representations by combining the context vectors of the words that occur in a text.</Paragraph> <Paragraph position="3"> In the traditional vector space model, context vectors are generated by representing the 3Context vectors represent the distributional profile of words, making it possible to express distributional similarity between words by standard vector similarity measures.</Paragraph> <Paragraph position="4"> data in a cooccurrence matrix F of order wxc, such that the rows Fw represent the words, the columns Fc represent the contexts (typically words or documents4), and the cells are the (weighted and normalized) cooccurrence counts of a given word in a given context. The point of this representation is that each row of cooccurrence counts can be interpreted as a cdimensional context vector vectorw for a given word. In the Random Indexing approach, the cooccurrence matrix is replaced by a context matrix G of order w x k, where k lessmuch c. Each row Gi is the k-dimensional context vector for word i. The context vectors are accumulated by adding together k-dimensional index vectors that have been assigned to each context in the data -- whether document, paragraph, clause, window, or neighboring words. The index vectors constitute a unique representation for each context, and are sparse, high-dimensional, and ternary, which means that their dimensionality k typically is on the order of thousands and that they consist of a small number of randomly distributed +1s and [?]1s. The k-dimensional index vectors are used to accumulate k-dimensional context vectors by the following procedure: every time a given word occurs in a context, that context's index vector is added (by vector addition) to the context vector for the given word. Note that the same procedure will produce a standard cooccurrence matrix F of order wxc if we use unary index vectors of the same dimensionality c as the number of contexts.5 Mathematically, the unary vectors are orthogonal, whereas the random index vectors are only nearly orthogonal. However, since there are more nearly orthogonal than truly orthogonal directions in a high-dimensional space, choosing random directions gets us sufficiently close to orthogonality to provide an approximation of the unary vectors (Hecht-Nielsen, 1994).</Paragraph> <Paragraph position="5"> The Random Indexing approach is motivated by the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984), which states that if we project points into a randomly selected subspace of sufficiently high dimensionality, the 4Words are used as contexts in e.g. Hyperspace Analogue to Language (HAL) (Lund et al., 1995), whereas documents are used in e.g. Latent Semantic Indexing/Analysis (LSI/LSA) (Deerwester et al., 1990; Landauer and Dumais, 1997).</Paragraph> <Paragraph position="6"> 5These unary index vectors would have a single 1 marking the place of the context in a list of all contexts -- the nth element of the index vector for the nth context would be 1.</Paragraph> <Paragraph position="7"> distances between the points are approximately preserved. Thus, if we collect the random index vectors into a random matrix R of order cxk, whose row Ri is the k-dimensional index vector for context i, we find that the following relation holds:</Paragraph> <Paragraph position="9"> That is, the Random Indexing context matrix G contains the same information as we get by multiplying the standard cooccurrence matrix F with the random matrix R, where RRT approximates the identity matrix.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Advantages of Random Indexing </SectionTitle> <Paragraph position="0"> One advantage of using Random Indexing is that it is an incremental method, which means that we do not have to sample all the data before we can start using the context vectors -- Random Indexing can provide intermediary results even after just a few vector additions.</Paragraph> <Paragraph position="1"> Other vector space models need to analyze the entire data before the context vectors are operational. null Another advantage is that Random Indexing avoids the &quot;huge matrix step&quot;, since the dimensionality k of the vectors is much smaller than, and not directly dependent on, the number of contexts c in the data. Other vector space models, including those that use dimension reduction techniques such as singular value decomposition, depend on building the w x c co-occurrence matrix F.</Paragraph> <Paragraph position="2"> This &quot;huge matrix step&quot; is perhaps the most serious deficiency of other models, since their complexity becomes dependent on the number of contexts c in the data, which typically is a very large number. Even methods that are mathematically equivalent to Random Indexing, such as random projection (Papadimitriou et al., 1998) and random mapping (Kaski, 1999), are not incremental, and require the initial w xc cooccurrence matrix.</Paragraph> <Paragraph position="3"> Since dimension reduction is built into Random Indexing, we achieve a significant gain in processing time and memory consumption, compared to other models. Furthermore, the approach is scalable, since adding new contexts to the data set does not increase the dimensionality of the context vectors.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Bag-of-Context vectors </SectionTitle> <Paragraph position="0"> The context vectors produced by Random Indexing can be used to generate BoC representations. This is done by, for every text, summing the (weighted) context vectors of the words that occur in the particular text. Note that summing vectors result in tf-weighting, since a word's vector is added to the text's vector as many times as the word occurs in the text. The same procedure generates standard BoW representations if we use unary index vectors of the same dimensionality as the number of words in the data instead of context vectors, and weight the summation of the unary index vectors with the idf-values of the words.6</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiment Setup </SectionTitle> <Paragraph position="0"> In the following sections, we describe the setup for our text categorization experiments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> We use the Reuters-21578 test collection, which consists of 21,578 news wire documents that have been manually assigned to different categories. In these experiments, we use the &quot;ModApte&quot; split, which divides the collection into 9,603 training documents and 3,299 test documents, assigned to 90 topic categories. After lemmatization, stopword filtering based on document frequency, and frequency thresholding that excluded words with frequency < 3, the training data contains 8,887 unique word types.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Representations </SectionTitle> <Paragraph position="0"> The standard BoW representations for this setup of Reuters-21578 are 8,887-dimensional and very sparse. To produce BoC representations, a k-dimensional random index vector is assigned to each training document. Context vectors for the words are then produced by adding the index vectors of a document to the context vector for a given word every time the word occur in that document.7 The context with reduced dimensionality), which we do by summing the weighted random index vectors of the words that occur in the text. We do not include any results from using reduced BoW representations in this paper, since they contain more noise than the standard BoW vectors. However, they are useful in very high-dimensional applications where efficiency is an important factor. 7We initially also used word-based contexts, where index vectors were assigned to each unique word, and context vectors were produced by adding the random index vectors of the surrounding words to the context vector of a given word every time the word ocurred in the training data. However, the word-based BoC representations consistently produced inferior results compared to the document-based ones, so we decided not to pursue the experiments with word-based BoC representations for this paper.</Paragraph> <Paragraph position="1"> vectors are then used to generate BoC representations for the texts by summing the context vectors of the words in each text, resulting in k-dimensional dense BoC vectors.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Support Vector Machines </SectionTitle> <Paragraph position="0"> For learning the categories, we use the Support Vector Machine (SVM) (Vapnik, 1995) algorithm for binary classification. SVM finds the separating hyperplane that has maximum margin between the two classes. Separating the examples with a maximum margin hyperplane is motivated by results from statistical learning theory, which states that a learning algorithm, to achieve good generalisation, should minimize both the empirical error and also the &quot;capacity&quot; of the functions that the learning algorithm implements. By maximizing the margin, the capacity or complexity of the function class (separating hyperplanes) is minimized. Finding this hyperplane is expressed as a mathematical optimization problem.</Paragraph> <Paragraph position="1"> Let {(vectorx1,y1),...,(vectorxl,yl)} where vectorxi [?] Rn,yi [?] +-1 be a set of training examples. The SVM separates these examples by a hyperplane defined by a weight vector vectorw and a threshold b, see Figure 1. The weight vector vectorw determines a direction perpendicular to the hyperplane, while b determines the distance to the hyperplane from the origin. A new example vectorz is classified according to which side of the hyperplane it belongs to. From the solution of the optimization problem, the weight vector vectorw has an expansion in a subset of the training examples, so classifying a new example vectorz is:</Paragraph> <Paragraph position="3"> where the ai variables are determined by the optimization procedure and K(vectorxi,vectorz) is the inner product between the example vectors.</Paragraph> <Paragraph position="4"> The examples marked with grey circles in Figure 1 are called Support Vectors. These examples uniquely define the hyperplane, so if the algorithm is re-trained using only the support vectors as training examples, the same separating hyperplane is found. When examples are not linearly separable, the SVM algorithm allows for the use of slack variables for allowing classification errors and the possibility to map examples to a (high-dimensional) feature space.</Paragraph> <Paragraph position="5"> In this feature space, a separating hyperplane can be found such that, when mapped back to input space, describes a non-linear decision rating a set of examples in R2. Support Vectors are marked with circles.</Paragraph> <Paragraph position="6"> function. The implicit mapping is performed by a kernel function that expresses the inner product between two examples in the desired feature space. This function replaces the function K(vectorxi,vectorz) in Equation 1.</Paragraph> <Paragraph position="7"> In our experiments, we use three standard kernel functions -- the basic linear kernel, the polynomial kernel, and the radial basis kernel:8</Paragraph> <Paragraph position="9"> For all experiments, we select d = 3 for the polynomial kernel and g = 1.0 for the radial basis kernel. These parameters are selected as default values and are not optimized.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> In these experiments, we use a one-against-all learning method, which means that we train one classifier for each category (and representation). When using the classifiers to predict the class of a test example, there are four possible outcomes; true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Positive means that the document was classified as belonging to the category, negative that it was not, whereas true means that the classification was correct and false that it was not. From these four outcomes, we can define the standard evaluation metrics precision P = TP/(TP + FP) and recall R =</Paragraph> <Paragraph position="2"> There are a number of parameters that need to be optimized in this kind of experiment, including the weighting scheme, the kernel function, and the dimensionality of the BoC vectors. For ease of exposition, we report the results of each parameter set separately. Since we do not experiment with feature selection in this investigation, our results will be somewhat lower than other published results that use SVM with optimized feature selection. Our main focus is to compare results produced with BoW and BoC representations, and not to produce a top score for the Reuters-21578 collection.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Weighting Scheme </SectionTitle> <Paragraph position="0"> Using appropriate word weighting functions is known to improve the performance of text categorization (Yang and Pedersen, 1997). In order to investigate the impact of using different word weighting schemes for concept-based representations, we compare the performance of the SVM using the following three weighting schemes: tf, idf, and tfxidf.</Paragraph> <Paragraph position="1"> The results are summarized in Table 1. The BoW run uses the linear kernel, while the BoC runs use the polynomial kernel. The numbers in boldface are the best BoC runs for tf, idf, and tfxidf, respectively.</Paragraph> <Paragraph position="2"> tfxidf using BoW and BoC representations.</Paragraph> <Paragraph position="3"> 9Micro-averaging means that we sum the TP, TN, FP and FN over all categories and then compute the F1 score. In macro-averaging, the F1 score is computed for each category, and then averaged.</Paragraph> <Paragraph position="4"> As expected, the best results for both BoW and BoC representations were produced using tfxidf. For the BoW vectors, tf consistently produced better results than idf, and it was even better than tfxidf using the polynomial and radial basis kernels. For the BoC vectors, the only consistent difference between tf and idf is found using the polynomial kernel, where idf outperforms tf.10 It is also interesting to note that for idf weighting, all BoC runs outperform BoW.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Parameterizing RI </SectionTitle> <Paragraph position="0"> In theory, the quality of the context vectors produced with the Random Indexing process should increase with their dimensionality. Kaski (1999) show that the higher the dimensionality of the vectors, the closer the matrix RRT will approximate the identity matrix, and Bingham and Mannila (2001) observe that the mean squared difference between RRT and the identity matrix is about 1k, where k is the dimensionality of the vectors. In order to evaluate the effects of dimensionality in this application, we compare the performance of the SVM with BoC representations using 9 different dimensionalities of the vectors. The index vectors consist of 4 to 60 non-zero elements ([?] 1% non-zeros), depending on their dimensionality. The results for all three kernels using tfxidf-weighting are displayed in Figure 2.</Paragraph> <Paragraph position="1"> their dimensionality as expected, but that the 10For the linear and radial basis kernels, the tendency is that tf in most cases is better than idf.</Paragraph> <Paragraph position="2"> increase levels out when the dimensionality becomes sufficiently large; there is hardly any difference in performance when the dimensionality of the vectors exceeds 2,500. There is even a slight tendency that the performance decreases when the dimensionality exceeds 5,000 dimensions; the best result is produced using 5,000dimensional vectors with 50 non-zero elements in the index vectors.</Paragraph> <Paragraph position="3"> There is a decrease in performance when the dimensionality of the vectors drops below 2,000.</Paragraph> <Paragraph position="4"> Still, the difference in F1 score between using 500 and 5,000 dimensions with the polynomial kernel and tfxidf is only 1.04, which indicates that Random Indexing is very robust in comparison to, e.g., singular value decomposition, where choosing appropriate dimensionality is critical.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Parameterizing SVM </SectionTitle> <Paragraph position="0"> Regarding the different kernel functions, Figure 2 clearly shows that the polynomial kernel produces consistently better results for the BoC vectors than the other kernels, and that the linear kernel consistently produces better results than the radial basis kernel. This could be a demonstration of the difficulties of parameter selection, especially for the g parameter in the radial basis kernel. To further improve the results, wecanfindbettervaluesof g fortheradial basis kernel and of d for the polynomial kernel by explicit parameter search.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Comparing BoW and BoC </SectionTitle> <Paragraph position="0"> If we compare the best BoW run (using the linear kernel and tf x idf-weighting) and the best BoC run (using 5,000-dimensional vectors with the polynomial kernel and tf x idf-weighting), we can see that the BoW representations barely outperform BoC: 82.77% versus 82.29%. However, if we only look at the results for the ten largest categories in the Reuters-21578 collection, the situation is reversed and the BoC representations outperform BoW. The F1 measure for the best BoC vectors for the ten largest categories is 88.74% compared to 88.09% for the best BoW vectors. This suggests that BoC representations are more appropriate for large-size categories.</Paragraph> <Paragraph position="1"> The best BoC representations outperform the best BoW representations in 16 categories, and are equal in 6. Of the 16 categories where the best BoC outperform the best BoW, 9 are better only in recall, 5 are better in both recall and precision, while only 2 are better only in precision. null It is always the same set of 22 categories where the BoC representations score better than, or equal to, BoW.11 These include the two largest categories in Reuters-21578, &quot;earn&quot; and &quot;acq&quot;, consisting of 2,877 and 1,650 documents, respectively. For these two categories, BoC representations outperform BoW with 95.57% versus 95.36%, and 91.07% versus 90.16%, respectively. The smallest of the &quot;BoC categories&quot; is &quot;fuel&quot;, which consists of 13 documents, and for which BoC outperforms BoW representations with 33.33% versus 30.77%. The largest performance difference for the &quot;BoC categories&quot; is for category &quot;bop&quot;, where BoC reaches 66.67%, while BoW only reaches 54.17%. We also note that it is the same set of categories that is problematic for both types of representations; where BoW score 0.0%, so does BoC.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Combining Representations </SectionTitle> <Paragraph position="0"> The above comparison suggests that we can improve the performance of the SVM by combining the two types of representation. The bestF1 score can be achieved by selecting the quadruple (TP,FP,TN,FN) for each individual category from either BoW or BoC so that it maximizes the overall score. There are 290 such combinations, but by expressing the F1 function in its equivalent form F1 = (2[?]TP)/(2[?]TP +FP + FN), we can determine that for our two top runs there are only 17 categories such that we need to perform an exhaustive search to find the best combination. For instance, if for one category both runs have the same TP but one of the runs have higher FP and FN, the other run is selected for that category and we do not include that category in the exhaustive search.</Paragraph> <Paragraph position="1"> Combining the best BoW and BoC runs increases the results from 82.77% (the best BoW run) to 83.91%. For the top ten categories, this increases the score from 88.74% (the best BoC run) to 88.99%. Even though the difference is admittedly small, the increase in performance when combining representations is not negligible, and is consistent with the findings of previous research (Cai and Hofmann, 2003).</Paragraph> <Paragraph position="2"> 11The &quot;BoC categories&quot; are: veg-oil, heat, gold, soybean, housing, jobs, nat-gas, cocoa, wheat, rapeseed, livestock, ship, fuel, trade, sugar, cpi, bop, lei, acq, crude, earn, money-fx.</Paragraph> </Section> class="xml-element"></Paper>