File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3024_intro.xml
Size: 3,071 bytes
Last Modified: 2025-10-06 14:02:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3024"> <Title>A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text classification is the assignment of predefined categories to text documents. Text classification has many applications in natural language processing tasks such as E-mail filtering, prediction of user preferences and organization of web content.</Paragraph> <Paragraph position="1"> The Naive Bayes classifier is a popular machine learning technique for text classification because it performs well in many domains, despite its simplicity (Domingos and Pazzani, 1997). Naive Bayes assumes a stochastic model of document generation.</Paragraph> <Paragraph position="2"> Using Bayes' rule, the model is inverted in order to predict the most likely class for a new document.</Paragraph> <Paragraph position="3"> We assume that documents are generated according to a multinomial event model (McCallum and Nigam, 1998). Thus a document is represented as a vector di = (xi1 ::: xijV j) of word counts where V is the vocabulary and each xit 2 f0; 1; 2;::: g indicates how often wt occurs in di. Given model parameters p(wtjcj) and class prior probabilities p(cj) and assuming independence of the words, the most likely class for a document di is computed as</Paragraph> <Paragraph position="5"> where n(wt;di) is the number of occurrences of wt in di. p(wtjcj) and p(cj) are estimated from training documents with known classes, using maximum likelihood estimation with a Laplacean prior:</Paragraph> <Paragraph position="7"> It is common practice to use only a subset of the words in the training documents for classification to avoid overfitting and make classification more efficient. This is usually done by assigning each word a score f(wt) that measures its usefulness for classification and selecting the N highest scored words. One of the best performing scoring functions for feature selection in text classification is mutual information (Yang and Pedersen, 1997).</Paragraph> <Paragraph position="8"> The mutual information between two random variables, MI(X; Y ), measures the amount of information that the value of one variable gives about the value of the other (Cover and Thomas, 1991).</Paragraph> <Paragraph position="9"> Note that in the multinomial model, the word variable W takes on values from the vocabulary V .</Paragraph> <Paragraph position="10"> In order to use mutual information with a multinomial model, one defines new random variables Wt 2 f0; 1g with p(Wt = 1) = p(W = wt) (Mc-Callum and Nigam, 1998; Rennie, 2001). Then the mutual information between a word wt and the class</Paragraph> <Paragraph position="12"> where p(x;cj) and p(x) are short for p(Wt = x;cj) and p(Wt = x). p(x;cj), p(x) and p(cj) are estimated from the training documents by counting how often wt occurs in each class.</Paragraph> </Section> class="xml-element"></Paper>