File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-3024_evalu.xml
Size: 3,525 bytes
Last Modified: 2025-10-06 13:59:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3024"> <Title>A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We compare KL and dKL to mutual information, using two standard data sets: 20 Newsgroups2 and Reuters 21578.3 In tokenizing the data, only words consisting of alphabetic characters are used after conversion to lower case. In addition, all numbers are mapped to a special token NUM. For 20 Newsgroups we remove the newsgroup headers and use a stoplist consisting of the 100 most frequent words of groups. The curves have small error bars.</Paragraph> <Paragraph position="1"> the British National Corpus.4 We use the ModApte split of Reuters 21578 (Apt'e et al., 1994) and use only the 10 largest classes. The vocabulary size is 111868 words for 20 Newsgroups and 22430 words for Reuters.</Paragraph> <Paragraph position="2"> Experiments with 20 Newsgroups are performed with 5-fold cross-validation, using 80% of the data for training and 20% for testing. We build a single classifier for the 20 classes and vary the number of selected words from 20 to 20000. Figure 1 compares classification accuracy for the three scoring functions. dKL slightly outperforms mutual information, especially for smaller vocabulary sizes. The difference is statistically significant for 20 to 200 words at the 99% confidence level, and for 20 to 2000 words at the 95% confidence level, using a one-tailed paired t-test.</Paragraph> <Paragraph position="3"> For the Reuters dataset we build a binary classifier for each of the ten topics and set the number of positively classified documents such that precision equals recall. Precision is the percentage of positive documents among all positively classified documents. Recall is the percentage of positive documents that are classified as positive.</Paragraph> <Paragraph position="4"> In Figures 2 and 3 we report microaveraged and macroaveraged recall for each number of selected words. Microaveraged recall is the percentage of all positive documents (in all topics) that are classified as positive. Macroaveraged recall is the average of the recall values of the individual topics. Microaveraged recall gives equal weight to the documents and thus emphasizes larger topics, while macroaveraged recall gives equal weight to the topics and thus emphasizes smaller topics more than microav- null eraged recall.</Paragraph> <Paragraph position="5"> Both KL and dKL achieve slightly higher values for microaveraged recall than mutual information, for most vocabulary sizes (Fig. 2). KL performs best at 20000 words with 90.1% microaveraged recall, compared to 89.3% for mutual information. The largest improvement is found for dKL at 100 words with 88.0%, compared to 86.5% for mutual information. null For smaller categories, the difference between the KL-divergence based scores and mutual information is larger, as indicated by the curves for macroaveraged recall (Fig. 3). KL yields the highest recall at 20000 words with 82.2%, an increase of 3.9% compared to mutual information with 78.3%, whereas dKL has its largest value at 100 words with 78.8%, compared to 76.1% for mutual information.</Paragraph> <Paragraph position="6"> We find the largest improvement at 5000 words with 5.6% for KL and 2.9% for dKL, compared to mutual information.</Paragraph> </Section> class="xml-element"></Paper>