File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4002_evalu.xml
Size: 4,993 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4002"> <Title>MMR-based feature selection for text categorization</Title> <Section position="5" start_page="21" end_page="21" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In order to compare the performance of MMR-based feature selection method with conventional IG and greedy feature selection method (Koller & Sahami's method, labeled 'Greedy'), we evaluated the three feature selection methods with four different learning algorithms: naive Bayes, TFIDF/Rocchio, Probabilistic Indexing (PrTFIDF [7]) and Maximum Entropy using Rainbow [6].</Paragraph> <Paragraph position="1"> We also compared the performance of conventional machine learning algorithms using our feature selection method and SVM using all features.</Paragraph> <Paragraph position="2"> MMR-based feature selection and greedy feature selection method (Koller & Sahami's method) requires quadratic time with respect to the number of features. To reduce this complexity, for each data set, we first selected 1000 features using IG, and then we applied MMR-based feature selection and greedy feature selection method to the selected 1000 features.</Paragraph> <Paragraph position="3"> For all datasets, we did not remove stopwords. The results reported on all dataset are averaged over 10 times of different test/training splits. A random subset of 20% of the data considered in an experiment was used for testing (i.e. we used Rainbow's '--test-set=0.2' and '--test=10' options), because Rainbow does not support 10-fold cross validation.</Paragraph> <Paragraph position="4"> MMR-based feature selection method needs to tune for l . It appears that a tuning method based on held-out data is needed here. We tested our method using 11 l values (i.e. 0, 0.1, 0.2, ..., 1) and selected the best l value.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.1 Reuters-21578 </SectionTitle> <Paragraph position="0"> The Reuters-21578 corpus contains 21578 articles taken from the Reuters newswire. Each article is typically designated into one or more semantic categories such as 'earn', 'trade', 'corn' etc., where the total number of categories is 114.</Paragraph> <Paragraph position="1"> Following [3], we constructed a subset from Reuter corpus. The subset is comprised of articles on the topic 'coffee', 'iron-steel', and 'livestock'.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.2 WebKB </SectionTitle> <Paragraph position="0"> This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (WebKb) project of the CMU text learning group. The 8282 pages were manually classified into 7 categories: 'course', 'department', 'faculty', 'project', 'staff', 'student' and 'other'. Following [1], we discarded the categories 'other', 'department' and 'staff'. The remaining part of the corpus contains 4199 documents in four categories.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Experimental Results </SectionTitle> <Paragraph position="0"> Figure 1 displays the performance curves for four different machine learning algorithms on the subset of Reuters after term selection using MMR-based feature selection (number of features is 25). When the parameter l =0.5, most machine learning algorithms have best performance and significant improvements compared to conventional information gain (i.e. l =1) and SVM us- null Table 1 shows the performance of four machine learning algorithms on WebKB using three feature selection methods and all features (41763 terms). In this data set, again MMR-based feature selection has best performance and significant improvements compared to greedy method and IG. Using MMR-based feature selection, for example, the vocabulary is reduced from 41763 terms to 200 (a 99.5% reduction), and the accuracy is improved from 85.26% to 90.49% in Naive Bayes. Using greedy method and IG, however, the accuracy is improved from 85.26% to about 87% in Naive Figure 1. MMR feature selection for four machine learning algorithms on Reuters (#features=25).</Paragraph> <Paragraph position="1"> Bayes. PrTFIDF is most sensitive to feature selection method. Using MMR-based feature selection the best accuracy is 82.47%. Using greedy method and IG, however, the best accuracy is only 72~74%. In this dataset, however, MMR-based feature selection does not produce improvements of conventional machine learning algorithms over SVM.</Paragraph> <Paragraph position="2"> The observation in Reuters and WebKB are highly consistent. MMR-based feature selection is consistently more effective than greedy method and IG on two data sets, and sometimes produces improvements even over the best SVM.</Paragraph> </Section> </Section> class="xml-element"></Paper>