File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3024_metho.xml
Size: 5,729 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3024"> <Title>A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Naive Bayes and KL-Divergence </SectionTitle> <Paragraph position="0"> There is a strong connection between Naive Bayes and KL-divergence (Kullback-Leibler divergence, relative entropy). KL-divergence measures how much one probability distribution is different from another (Cover and Thomas, 1991). It is defined (for discrete distributions) by</Paragraph> <Paragraph position="2"> By viewing a document as a probability distribution over words, Naive Bayes can be interpreted in an information-theoretic framework (Dhillon et al., 2002). Let p(wtjd) = n(wt;d)=jdj. Taking logarithms and dividing by the length of d, (1) can be rewritten as</Paragraph> <Paragraph position="4"> This means that Naive Bayes assigns to a document d the class which is &quot;most similar&quot; to d in terms of the distribution of words. Note also that the prior probabilities are usually dominated by document probabilities except for very short documents.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Feature Selection using KL-Divergence </SectionTitle> <Paragraph position="0"> We define a new scoring function for feature selection based on the following considerations. In the previous section we have seen that Naive Bayes assigns a document d the class c such that the &quot;distance&quot; between d and c is minimized. A classification error occurs when a test document is closer to some other class than to its true class, in terms of KL-divergence.</Paragraph> <Paragraph position="1"> We seek to define a scoring function such that words whose distribution in the individual training documents of a class is much different from the distribution in the class (according to (2)) receive a lower score, while words with a similar distribution in all training documents of the same class receive a higher score. By removing words with a lower score from the vocabulary, the training documents of each class become more similar to each other, and therefore, also to the class, in terms of word distribution. This leads to more homogeneous classes.</Paragraph> <Paragraph position="2"> Assuming that the test documents and training documents come from the same distribution, the similarity between the test documents and their respective classes will be increased as well, thus resulting in higher classification accuracy.</Paragraph> <Paragraph position="3"> We now make this more precise. Let S = fd1;::: ;djSjg be the set of training documents, and denote the class of di with c(di). The average KL-divergence for a word wt between the training documents and their classes is given by</Paragraph> <Paragraph position="5"> One problem with (8) is that in addition to the conditional probabilities p(wtjcj) for each word and each class, the computation considers each individual document, thus resulting in a time requirement of O(jSj).1 In order to avoid this additional complexity, instead of KLt(S) we use an approximation fKLt(S), which is based on the following two assumptions: (i) the number of occurrences of wt is the same in all documents that contain wt, (ii) all documents in the same class cj have the same length. Let Njt be the number of documents in cj that contain wt, and let</Paragraph> <Paragraph position="7"> be the average probability of wt in those documents in cj that contain wt (if wt does not occur in cj, set</Paragraph> <Paragraph position="9"> Plugging in (9) and (3) and defining q(wtjcj) = Njt=jcjj, we get</Paragraph> <Paragraph position="11"> Note that computing fKLt(S) only requires a statistics of the number of words and documents for each 1Note that KLt(S) cannot be computed simultaneously with p(wtjcj) in one pass over the documents in (2): KLt(S) requires p(wtjcj) when each document is considered, but computing the latter needs iterating over all documents itself.</Paragraph> <Paragraph position="12"> class, not per document. Thus fKLt(S) can be computed in O(jCj). Typically, jCj is much smaller than jSj.</Paragraph> <Paragraph position="13"> Another important thing to note is the following.</Paragraph> <Paragraph position="14"> By removing words with an uneven distribution in the documents of the same class, not only the documents in the class, but also the classes themselves may become more similar, which reduces the ability to distinguish between different classes. Let p(wt) be the number of occurrences of wt in all training documents, divided by the total number of words,</Paragraph> <Paragraph position="16"> eKt(S) can be interpreted as an approximation of the average divergence of the distribution of wt in the individual training documents from the global distribution (averaged over all training documents in all classes). If wt is independent of the class, then eKt(S) = fKLt(S). The difference between the two is a measure of the increase in homogeneity of the training documents, in terms of the distribution of wt, when the documents are clustered in their true classes. It is large if the distribution of wt is similar in the training documents of the same class but dissimilar in documents of different classes. In analogy to mutual information, we define our new scoring function as the difference</Paragraph> <Paragraph position="18"> We also use a variant of KL, denoted dKL, where p(wt) is estimated according to (14):</Paragraph> <Paragraph position="20"> and p(wtjcj) is estimated as in (2).</Paragraph> </Section> class="xml-element"></Paper>