File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3024_intro.xml

Size: 3,071 bytes

Last Modified: 2025-10-06 14:02:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3024">
  <Title>A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text classification is the assignment of predefined categories to text documents. Text classification has many applications in natural language processing tasks such as E-mail filtering, prediction of user preferences and organization of web content.</Paragraph>
    <Paragraph position="1"> The Naive Bayes classifier is a popular machine learning technique for text classification because it performs well in many domains, despite its simplicity (Domingos and Pazzani, 1997). Naive Bayes assumes a stochastic model of document generation.</Paragraph>
    <Paragraph position="2"> Using Bayes' rule, the model is inverted in order to predict the most likely class for a new document.</Paragraph>
    <Paragraph position="3"> We assume that documents are generated according to a multinomial event model (McCallum and Nigam, 1998). Thus a document is represented as a vector di = (xi1 ::: xijV j) of word counts where V is the vocabulary and each xit 2 f0; 1; 2;::: g indicates how often wt occurs in di. Given model parameters p(wtjcj) and class prior probabilities p(cj) and assuming independence of the words, the most likely class for a document di is computed as</Paragraph>
    <Paragraph position="5"> where n(wt;di) is the number of occurrences of wt in di. p(wtjcj) and p(cj) are estimated from training documents with known classes, using maximum likelihood estimation with a Laplacean prior:</Paragraph>
    <Paragraph position="7"> It is common practice to use only a subset of the words in the training documents for classification to avoid overfitting and make classification more efficient. This is usually done by assigning each word a score f(wt) that measures its usefulness for classification and selecting the N highest scored words. One of the best performing scoring functions for feature selection in text classification is mutual information (Yang and Pedersen, 1997).</Paragraph>
    <Paragraph position="8"> The mutual information between two random variables, MI(X; Y ), measures the amount of information that the value of one variable gives about the value of the other (Cover and Thomas, 1991).</Paragraph>
    <Paragraph position="9"> Note that in the multinomial model, the word variable W takes on values from the vocabulary V .</Paragraph>
    <Paragraph position="10"> In order to use mutual information with a multinomial model, one defines new random variables Wt 2 f0; 1g with p(Wt = 1) = p(W = wt) (Mc-Callum and Nigam, 1998; Rennie, 2001). Then the mutual information between a word wt and the class</Paragraph>
    <Paragraph position="12"> where p(x;cj) and p(x) are short for p(Wt = x;cj) and p(Wt = x). p(x;cj), p(x) and p(cj) are estimated from the training documents by counting how often wt occurs in each class.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML