A New Feature Selection Score for Multinomial Naive
Bayes Text Classification Based on KL-Divergence
Karl-Michael Schneider
Department of General Linguistics
University of Passau
94032 Passau, Germany
schneide@phil.uni-passau.de
Abstract
We define a new feature selection score for text
classification based on the KL-divergence between
the distribution of words in training documents and
their classes. The score favors words that have a
similar distribution in documents of the same class
but different distributions in documents of different
classes. Experiments on two standard data sets in-
dicate that the new method outperforms mutual in-
formation, especially for smaller categories.
1 Introduction
Text classification is the assignment of predefined
categories to text documents. Text classification
has many applications in natural language process-
ing tasks such as E-mail filtering, prediction of user
preferences and organization of web content.
The Naive Bayes classifier is a popular machine
learning technique for text classification because it
performs well in many domains, despite its simplic-
ity (Domingos and Pazzani, 1997). Naive Bayes as-
sumes a stochastic model of document generation.
Using Bayes’ rule, the model is inverted in order to
predict the most likely class for a new document.
We assume that documents are generated accord-
ing to a multinomial event model (McCallum and
Nigam, 1998). Thus a document is represented as
a vector di = (xi1 ::: xijV j) of word counts where
V is the vocabulary and each xit 2 f0; 1; 2;::: g in-
dicates how often wt occurs in di. Given model pa-
rameters p(wtjcj) and class prior probabilities p(cj)
and assuming independence of the words, the most
likely class for a document di is computed as
c (di) = argmax
j
p(cj)p(djcj)
= argmax
j
p(cj)
jV jY
t=1
p(wtjcj)n(wt;di)
(1)
where n(wt;di) is the number of occurrences of wt
in di. p(wtjcj) and p(cj) are estimated from train-
ing documents with known classes, using maximum
likelihood estimation with a Laplacean prior:
p(wtjcj) =
1 + Pdi2cj n(wt;di)
jV j + PjV jt=1 Pdi2cj n(wt;di)
(2)
p(cj) = jcjjPjCj
j0=1 jcj0j
(3)
It is common practice to use only a subset of
the words in the training documents for classifi-
cation to avoid overfitting and make classification
more efficient. This is usually done by assigning
each word a score f(wt) that measures its useful-
ness for classification and selecting the N highest
scored words. One of the best performing scoring
functions for feature selection in text classification
is mutual information (Yang and Pedersen, 1997).
The mutual information between two random vari-
ables, MI(X; Y ), measures the amount of informa-
tion that the value of one variable gives about the
value of the other (Cover and Thomas, 1991).
Note that in the multinomial model, the word
variable W takes on values from the vocabulary V .
In order to use mutual information with a multi-
nomial model, one defines new random variables
Wt 2 f0; 1g with p(Wt = 1) = p(W = wt) (Mc-
Callum and Nigam, 1998; Rennie, 2001). Then the
mutual information between a word wt and the class
variable C is
MI(Wt; C) =
jCjX
j=1
X
x=0;1
p(x;cj) log p(x;cj)p(x)p(c
j)
(4)
where p(x;cj) and p(x) are short for p(Wt = x;cj)
and p(Wt = x). p(x;cj), p(x) and p(cj) are esti-
mated from the training documents by counting how
often wt occurs in each class.
2 Naive Bayes and KL-Divergence
There is a strong connection between Naive Bayes
and KL-divergence (Kullback-Leibler divergence,
relative entropy). KL-divergence measures how
much one probability distribution is different from
another (Cover and Thomas, 1991). It is defined (for
discrete distributions) by
KL(p;q) =
X
x
p(x) log p(x)q(x): (5)
By viewing a document as a probability distribu-
tion over words, Naive Bayes can be interpreted in
an information-theoretic framework (Dhillon et al.,
2002). Let p(wtjd) = n(wt;d)=jdj. Taking loga-
rithms and dividing by the length of d, (1) can be
rewritten as
c (d)
= argmax
j
log p(cj) +
jV jX
t=1
n(wt;d) log p(wtjcj)
= argmax
j
1
jdj log p(cj) +
jV jX
t=1
p(wtjd) log p(wtjcj)
(6)
Adding the entropy of p(Wjd), we get
c (d)
= argmax
j
1
jdj log p(cj)  
jV jX
t=1
p(wtjd) log p(wtjd)p(w
tjcj)
= argmin
j
KL(p(Wjd);p(Wjcj))  1jdj log p(cj)
(7)
This means that Naive Bayes assigns to a document
d the class which is “most similar” to d in terms
of the distribution of words. Note also that the
prior probabilities are usually dominated by docu-
ment probabilities except for very short documents.
3 Feature Selection using KL-Divergence
We define a new scoring function for feature selec-
tion based on the following considerations. In the
previous section we have seen that Naive Bayes as-
signs a document d the class c such that the “dis-
tance” between d and c is minimized. A classifi-
cation error occurs when a test document is closer
to some other class than to its true class, in terms of
KL-divergence.
We seek to define a scoring function such that
words whose distribution in the individual training
documents of a class is much different from the dis-
tribution in the class (according to (2)) receive a
lower score, while words with a similar distribution
in all training documents of the same class receive
a higher score. By removing words with a lower
score from the vocabulary, the training documents
of each class become more similar to each other,
and therefore, also to the class, in terms of word dis-
tribution. This leads to more homogeneous classes.
Assuming that the test documents and training doc-
uments come from the same distribution, the simi-
larity between the test documents and their respec-
tive classes will be increased as well, thus resulting
in higher classification accuracy.
We now make this more precise. Let S =
fd1;::: ;djSjg be the set of training documents, and
denote the class of di with c(di). The average KL-
divergence for a word wt between the training doc-
uments and their classes is given by
KLt(S) = 1jSj
X
di2S
KL(p(wtjdi);p(wtjc(di))):
(8)
One problem with (8) is that in addition to the con-
ditional probabilities p(wtjcj) for each word and
each class, the computation considers each individ-
ual document, thus resulting in a time requirement
of O(jSj).1 In order to avoid this additional com-
plexity, instead of KLt(S) we use an approxima-
tion fKLt(S), which is based on the following two
assumptions: (i) the number of occurrences of wt
is the same in all documents that contain wt, (ii)
all documents in the same class cj have the same
length. Let Njt be the number of documents in cj
that contain wt, and let
~pd(wtjcj) = p(wtjcj)jcjjN
jt
(9)
be the average probability of wt in those documents
in cj that contain wt (if wt does not occur in cj, set
~pd(wtjcj) = 0). Then KLt(S) reduces to
fKLt(S) = 1
jSj
jCjX
j=1
Njt ~pd(wtjcj) log ~pd(wtjcj)p(w
tjcj)
:
(10)
Plugging in (9) and (3) and defining q(wtjcj) =
Njt=jcjj, we get
fKLt(S) =  
jCjX
j=1
p(cj)p(wtjcj) log q(wtjcj): (11)
Note that computing fKLt(S) only requires a statis-
tics of the number of words and documents for each
1Note that KLt(S) cannot be computed simultaneously with
p(wtjcj) in one pass over the documents in (2): KLt(S) re-
quires p(wtjcj) when each document is considered, but com-
puting the latter needs iterating over all documents itself.
class, not per document. Thus fKLt(S) can be com-
puted in O(jCj). Typically, jCj is much smaller
than jSj.
Another important thing to note is the following.
By removing words with an uneven distribution in
the documents of the same class, not only the doc-
uments in the class, but also the classes themselves
may become more similar, which reduces the ability
to distinguish between different classes. Let p(wt)
be the number of occurrences of wt in all training
documents, divided by the total number of words,
q(wt) = PjCjj=1 Njt=jSj and define
eKt(S) =  p(wt) log q(wt): (12)
eKt(S) can be interpreted as an approximation of the
average divergence of the distribution of wt in the
individual training documents from the global dis-
tribution (averaged over all training documents in
all classes). If wt is independent of the class, then
eKt(S) = fKLt(S). The difference between the two
is a measure of the increase in homogeneity of the
training documents, in terms of the distribution of
wt, when the documents are clustered in their true
classes. It is large if the distribution of wt is similar
in the training documents of the same class but dis-
similar in documents of different classes. In analogy
to mutual information, we define our new scoring
function as the difference
KL(wt) = eKt(S)  fKLt(S): (13)
We also use a variant of KL, denoted dKL, where
p(wt) is estimated according to (14):
p0(wt) =
jCjX
j=1
p(cj)p(wtjcj) (14)
and p(wtjcj) is estimated as in (2).
4 Experiments
We compare KL and dKL to mutual information,
using two standard data sets: 20 Newsgroups2 and
Reuters 21578.3 In tokenizing the data, only words
consisting of alphabetic characters are used after
conversion to lower case. In addition, all numbers
are mapped to a special token NUM. For 20 News-
groups we remove the newsgroup headers and use a
stoplist consisting of the 100 most frequent words of
2http://www.ai.mit.edu/ jrennie/20Newsgroups/
3http://www.daviddlewis.com/resources/testcollections/
reuters21578/
dKL
KL
MI
Vocabulary Size
Classification
Accuracy
10000010000100010010
1
0.8
0.6
0.4
0.2
0
Figure 1: Classification accuracy for 20 News-
groups. The curves have small error bars.
the British National Corpus.4 We use the ModApte
split of Reuters 21578 (Apt´e et al., 1994) and use
only the 10 largest classes. The vocabulary size is
111868 words for 20 Newsgroups and 22430 words
for Reuters.
Experiments with 20 Newsgroups are performed
with 5-fold cross-validation, using 80% of the data
for training and 20% for testing. We build a sin-
gle classifier for the 20 classes and vary the num-
ber of selected words from 20 to 20000. Figure 1
compares classification accuracy for the three scor-
ing functions. dKL slightly outperforms mutual in-
formation, especially for smaller vocabulary sizes.
The difference is statistically significant for 20 to
200 words at the 99% confidence level, and for 20
to 2000 words at the 95% confidence level, using a
one-tailed paired t-test.
For the Reuters dataset we build a binary classi-
fier for each of the ten topics and set the number of
positively classified documents such that precision
equals recall. Precision is the percentage of posi-
tive documents among all positively classified doc-
uments. Recall is the percentage of positive docu-
ments that are classified as positive.
In Figures 2 and 3 we report microaveraged and
macroaveraged recall for each number of selected
words. Microaveraged recall is the percentage of all
positive documents (in all topics) that are classified
as positive. Macroaveraged recall is the average of
the recall values of the individual topics. Microav-
eraged recall gives equal weight to the documents
and thus emphasizes larger topics, while macroav-
eraged recall gives equal weight to the topics and
thus emphasizes smaller topics more than microav-
4http://www.itri.brighton.ac.uk/ Adam.Kilgarriff/bnc-
readme.html
dKL
KL
MI
Vocabulary Size
Precision/Recall
Break
e
v
en
P
oint
10000010000100010010
1
0.95
0.9
0.85
0.8
0.75
0.7
Figure 2: Microaveraged recall on Reuters at break-
even point.
dKL
KL
MI
Vocabulary Size
Precision/Recall
Break
e
v
en
P
oint
10000010000100010010
1
0.95
0.9
0.85
0.8
0.75
0.7
Figure 3: Macroaveraged recall on Reuters at break-
even point.
eraged recall.
Both KL and dKL achieve slightly higher values
for microaveraged recall than mutual information,
for most vocabulary sizes (Fig. 2). KL performs best
at 20000 words with 90.1% microaveraged recall,
compared to 89.3% for mutual information. The
largest improvement is found for dKL at 100 words
with 88.0%, compared to 86.5% for mutual infor-
mation.
For smaller categories, the difference between
the KL-divergence based scores and mutual infor-
mation is larger, as indicated by the curves for
macroaveraged recall (Fig. 3). KL yields the high-
est recall at 20000 words with 82.2%, an increase of
3.9% compared to mutual information with 78.3%,
whereas dKL has its largest value at 100 words with
78.8%, compared to 76.1% for mutual information.
We find the largest improvement at 5000 words with
5.6% for KL and 2.9% for dKL, compared to mutual
information.
5 Conclusion
By interpreting Naive Bayes in an information the-
oretic framework, we derive a new scoring method
for feature selection in text classification, based on
the KL-divergence between training documents and
their classes. Our experiments show that it out-
performs mutual information, which was one of
the best performing methods in previous studies
(Yang and Pedersen, 1997). The KL-divergence
based scores are especially effective for smaller cat-
egories, but additional experiments are certainly re-
quired.
In order to keep the computational cost low,
we use an approximation instead of the exact KL-
divergence. Assessing the error introduced by this
approximation is a topic for future work.
References
Chidanand Apt´e, Fred Damerau, and Sholom M.
Weiss. 1994. Towards language independent au-
tomated learning of text categorization models.
In Proc. 17th ACM SIGIR Conference on Re-
search and Development in Information Retrieval
(SIGIR ’94), pages 23–30.
Thomas M. Cover and Joy A. Thomas. 1991. El-
ements of Information Theory. John Wiley, New
York.
Inderjit S. Dhillon, Subramanyam Mallela, and Ra-
hul Kumar. 2002. Enhanced word clustering for
hierarchical text classification. In Proc. 8th ACM
SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 191–
200.
Pedro Domingos and Michael Pazzani. 1997. On
the optimality of the simple bayesian classifier
under zero-one loss. Machine Learning, 29:103–
130.
Andrew McCallum and Kamal Nigam. 1998. A
comparison of event models for Naive Bayes text
classification. In Learning for Text Categoriza-
tion: Papers from the AAAI Workshop, pages 41–
48. AAAI Press. Technical Report WS-98-05.
Jason D. M. Rennie. 2001. Improving multi-class
text classification with Naive Bayes. Master’s
thesis, Massachusetts Institute of Technology.
Yiming Yang and Jan O. Pedersen. 1997. A com-
parative study on feature selection in text catego-
rization. In Proc. 14th International Conference
on Machine Learning (ICML-97), pages 412–
420.
