Poisson Naive Bayes for Text Classification with Feature Weighting
Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim
Dept. of CSE., Korea University
5-ka Anamdong, SungPuk-ku, SEOUL 136-701, KOREA
CUsbkim,hcseo,rimCV@nlp.korea.ac.kr
Abstract
In this paper, we investigate the use of
multivariate Poisson model and feature
weighting to learn naive Bayes text clas-
sifier. Our new naive Bayes text classifi-
cation model assumes that a document is
generated by a multivariate Poisson model
while the previous works consider a doc-
ument as a vector of binary term features
based on the presence or absence of each
term. We also explore the use of feature
weighting for the naive Bayes text classifi-
cation rather than feature selection, which
is a quite costly process when a small
number of the new training documents are
continuously provided.
Experimental results on the two test col-
lections indicate that our new model with
the proposed parameter estimation and the
feature weighting technique leads to sub-
stantial improvements compared to the
unigram language model classifiers that
are known to outperform the original pure
naive Bayes text classifiers.
1 Introduction
The naive Bayes classifier has been one of the core
frameworks in the information retrieval research for
many years. Recently, naive Bayes is emerged as a
research topic itself because it sometimes achieves
good performances on various tasks, compared to
more complex learning algorithms, in spite of the
wrong independence assumptions on naive Bayes.
Similarly, naive Bayes is also an attractive ap-
proach in the text classification task because it is
simple enough to be practically implemented even
with a great number of features. This simplicity en-
ables us to integrate the text classification and filter-
ing modules with the existing information retrieval
systems easily. It is because that the frequency re-
lated information stored in the general text retrieval
systems is all the required information in naive
Bayes learning. No further complex generaliza-
tion processes are required unlike the other machine
learning methods such as SVM or boosting. More-
over, incremental adaptation using a small number
of new training documents can be performed by just
adding or updating frequencies.
Several earlier works have extensively studied the
naive Bayes text classification (Lewis, 1992; Lewis,
1998; McCallum and Nigam, 1998). However,
their pure naive Bayes classifiers considered a doc-
ument as a binary feature vector, and so they can’t
utilize the term frequencies in a document, result-
ing the poor performances. For that reason, the
unigram language model classifier (or multinomial
naive Bayes text classifier) has been referred as an
alternative and promising naive Bayes by a num-
ber of researchers(McCallum and Nigam, 1998; Du-
mais et al., 1998; Yang and Liu, 1999; Nigam et
al., 2000). Although the unigram language model
classifiers usually outperform the pure naive Bayes,
they also have given the disappointing results com-
pared to many other statistical learning methods
such as nearest neighbor classifiers(Yang and Chute,
1994), support vector machines(Joachims, 1998),
and boosting(Schapire and Singer, 2000), etc.
In the real world, an operational text classifica-
tion system is usually placed in the environment
where the amount of human-annotated training doc-
uments is small in spite of the hundreds of thousands
classes. Moreover, re-training of the text classifiers
is required since a small number of new training
documents are continuously provided. In this envi-
ronment, naive Bayes is probably the most appropri-
ate model for the practical systems rather than other
complex learning models. Therefore, more inten-
sive studies about the naive Bayes text classification
model are required.
In this paper, we revisit the naive Bayes frame-
work, and propose a Poisson naive Bayes model for
text classification with a statistical feature weight-
ing method. Feature weighting has many advan-
tages compared to the previous feature selection ap-
proaches, especially when the new training exam-
ples are continuously provided. Our new model as-
sumes that a document is generated by a multivari-
ate Poisson model, and their parameters are esti-
mated by weighted averaging of the normalized and
smoothed term frequencies over all the training doc-
uments. Under the assumption, we have tested the
feature weighting approach with three measures: in-
formation gain, AV
BE
-statistic, and newly introduced
probability ratio. With the proposed model and fea-
ture weighting techniques, we can get much better
performance without losing the simplicity and effi-
ciency of the naive Bayes model.
The remainder of this paper is organized as fol-
lows. The next section presents a naive Bayes frame-
work for the text classification briefly. Section 3
describes our new naive Bayes model and the pro-
posed technique, and the experimental results are
presented in Section 4. Finally, we conclude the pa-
per by suggesting possible directions for future work
in Section 5.
2 Naive Bayes Text Classification
A naive Bayes classifier is a well-known and highly
practical probabilistic classifier, and has been em-
ployed in many applications. It assumes that all
attributes of the examples are independent of each
other given the context of the class, that is, an in-
dependent assumption. Several studies show that
naive Bayes performs surprisingly well in many do-
mains(Domingos and Pazzani, 1997) in spite of its
wrong independent assumption.
In the context of text classification, the probabil-
ity of class CR given a document CS
CY
is calculated by
Bayes’ theorem as follows:
D4B4CRCYCS
CY
B5 BP
D4B4CS
CY
CYCRB5D4B4CRB5
D4B4CS
CY
B5
BP
D4B4CS
CY
CYCRB5D4B4CRB5
D4B4CS
CY
CYCRB5D4B4CRB5B7D4B4CS
CY
CYAMCRB5D4B4AMCRB5
BP
D4B4CS
CY
CYCRB5
D4B4CS
CY
CYAMCRB5
A1 D4B4CRB5
D4B4CS
CY
CYCRB5
D4B4CS
CY
CYAMCRB5
A1 D4B4CRB5B7D4B4AMCRB5
(1)
Now, if we define a new function DE
CYCR
,
DE
CYCR
BPD0D3CV
D4B4CS
CY
CYCRB5
D4B4CS
CY
CYAMCRB5
(2)
then, Equation (1) can be rewritten as
D4B4CRCYCS
CY
B5BP
CT
DE
CYCR
A1 D4B4CRB5
CT
DE
CYCR
A1 D4B4CRB5B7D4B4AMCRB5
(3)
Using Equation (3), we can get the posterior prob-
ability D4B4CRCYCS
CY
B5 by obtaining DE
CYCR
, which is a form of
log ratio similar to the BIM retrieval model(Jones et
al., 2000). It means that the linked independence as-
sumption(Cooper et al., 1992), which explains that
the strong independent assumption can be relaxed
in the BIM model, is sufficient for the use of naive
Bayes text classification model.
With this framework, two representative naive
Bayes text classification approaches are well intro-
duced in (McCallum and Nigam, 1998). They desig-
nated the pure naive Bayes as multivariate Bernoulli
model, and the unigram language model classifier as
multinomial model. Instead, we introduce multivari-
ate Poisson model to improve the pure naive Bayes
text classification in the next section.
3 Poisson Naive Bayes Text Classification
3.1 Overview
The Poisson distribution is most commonly used to
model the number of random occurrences of some
phenomenon in a specified unit of space or time,
for example, the number of phone calls received by
a telephone operator in a 10-minute period. If we
think that the occurrence of each term is a random
occurrence in a fixed unit of space (i.e. a length
of document) the Poisson distribution is intuitively
suitable to model the term frequencies in a given
document. For that reason, the use of Poisson model
is widely investigated in the IR literature, but it is
rarely used for the text classification task(Lewis,
1998). It motivates us to adopt the Poisson model
for learning the naive Bayes text classification.
Our model assumes that CS
CY
is generated by multi-
variate Poisson model. In other words, a document
CS
CY
is a random vector which consists of the Poisson
random variables CG
CX
, and CG
CX
has the value of within-
term-frequency CU
CXCY
for the CX-th term D8
CX
. Thus, if we
assume the independence among the terms in CS
CY
,a
probability of CS
CY
is represented by,
D4B4CS
CY
B5BP
CYCE CY
CH
CXBPBD
C8B4CG
CX
BP CU
CXCY
B5 (4)
where, CYCE CY is a vocabulary size, and each C8B4CG
CX
BP
CU
CXCY
B5 is given by,
C8B4CG
CX
BP CU
CXCY
B5BP
CT
A0AL
AL
CU
CXCY
CU
CXCY
AX
(5)
where, AL is the Poisson mean.
As a result, the DE
CYCR
function of Equation (2) is
rewritten using Equations (4) and (5) as follows:
DE
CYCR
BP
CYCE CY
CG
CXBPBD
D0D3CV
C8B4CG
CX
BP CU
CXCY
CYCRB5
C8B4CG
CX
BP CU
CXCY
CYAMCRB5
BP
CYCE CY
CG
CXBPBD
D0D3CV
CT
A0AL
CX
AL
CX
CU
CXCY
CT
A0AM
CX
AM
CX
CU
CXCY
(6)
where, AL
CX
and AM
CX
is the Poisson mean for D8
CX
in class
CR and class AMCR, respectively.
The most important issues of this work are as fol-
lows:
AF How to decide the frequency CU
CXCY
representing
the characteristic of document CS
CY
?
AF How to estimate the model parameter AL
CX
and AM
CX
representing the characteristic of each class?
We propose the possible answers in the next subsec-
tion.
3.2 Parameter Estimation
Since CU
CXCY
is a frequency of a term CX in a document
CS
CY
with a fixed length according to the definition of
Poisson distribution, we should normalize the actual
term frequencies in the documents with the different
length. In addition, many earlier works in NLP and
IR fields recommend that smoothing term frequen-
cies is necessary in order to build a more accurate
model.
Thus, we estimate CU
CXCY
as the normalized and
smoothed frequency of actual term frequency DC
CXCY
,
represented by,
DI
CU
CXCY
BP
DC
CXCY
B7 AI
CSD0
CY
B7 AI A1CYCE CY
A1 AS (7)
where AI is a laplace smoothing parameter, AS is any
huge value which makes all the
DI
CU
CXCY
in our model an
integer value
1
, and CSD0
CY
is the length of CS
CY
.
Conceptually,
DI
CU
CXCY
can be regarded as the value es-
timated by the following steps : 1) Add AI of all CYCE CY
terms to the document CS
CY
, 2) Scale CS
CY
up to CS
BC
CY
whose
total length is AS without changing the proportion of
frequency for each term D8
CX
, 3) Count D8
CX
in CS
BC
CY
.
Then, Poisson mean AL
CX
, which represents an aver-
age number of occurrence of D8
CX
in the documents be-
longing to class CR, is estimated using the normalized
and smoothed
DI
CU
CXCY
values over the training documents
as follows:
DI
AL
CX
BP
CG
CS
CY
BEBW
CR
CVB4CS
CY
CYCRB5 A1
DI
CU
CXCY
(8)
where BW
CR
is the set of training documents belonging
to class CR, and CVB4CS
CY
CYCRB5
2
is the interpolation of the
uniform probability and the probability proportional
to the length of the document, and the interpolation
is calculated as follows:
CVB4CS
CY
CYCRB5 BP AB
BD
CYBW
CR
CY
B7B4BDA0 ABB5
CSD0
CY
C8
CS
CY
BEBW
CR
CSD0
CY
(9)
Simple averaging of
DI
CU
CXCY
, the case of AB=1, seems to
be correct to estimate AL
CX
. However, the statistics
1
Since CU
CXCY
is a value of random variable CG
CXCY
representing
the frequency in our Poisson distribution, we multiply the nor-
malized frequency with some unnatural constant AS to make CU
CXCY
integer value. However, AS is dropped in the final induced func-
tion.
2
We use the notation CVB4CS
CY
CYCRB5 for the distribution defined
only in the training documents, to distinguish it from the no-
tation D4B4CS
CY
CYCRB5 used in the Section 2.
from the long documents can be more reliable than
those in the short documents, hence we try to inter-
polate between the two different probabilities with
the parameter AB ranging from 0 to 1. Consequently,
AL
CX
is a weighted average over all training documents
belonging to the class CR, and AM
CX
for the class AMCR can be
estimated in the same manner.
3.3 Feature Weighting
Feature selection is often performed as a preprocess-
ing step for the purpose of both reducing the fea-
ture space and improving the classification perfor-
mance. Text classifiers are then trained with various
machine learning algorithms in the resulting feature
space. (Yang and Pedersen, 1997) investigated some
measures to select useful term features including
mutual information(MI), information gain(IG), and
AV
BE
-statistics(CHI), etc. On the contrary, (Joachims,
1998) claimed that there is no useless term features,
and it is preferable to use all term features. It is
clear that learning and classification become very
efficient when the feature space is considerably re-
duced. However, there is no definite conclusion
about the contribution of feature selection to im-
prove overall performances of the text classification
systems. It may considerably depend on the em-
ployed learning algorithm. We believe that proper
external feature selection or weighting is required to
improve the performances of naive Bayes since the
naive Bayes has no framework of the discriminative
optimizing process in itself. Of the two possible ap-
proaches, feature selection is very inefficient in case
that the additional training documents are provided
continuously. It is because the feature set should
be redefined according to the modified term statis-
tics in the new training document set, and classifiers
should be trained again with this new feature set. For
that reason, we prefer to use feature weighting to
improve naive Bayes rather than feature selection.
With the feature weighting method, our DE
CYCR
is rede-
fined as follows:
DE
CYCR
BP
CYCE CY
CG
CXBPBD
DB
CXCR
W
CR
A1 D0D3CV
CT
A0
DI
AL
CX DI
AL
CX
DI
CU
CXCY
CT
A0 DIAM
CX
DIAM
CX
DI
CU
CXCY
(10)
where, DB
CXCR
is the weight of feature for the class CR,
and W
CR
is the normalization factor, that is,
C8
CE
CXBPBD
DB
CXCR
.
In our work, three measures are used to weight
Table 1: Two-way contingency table
presence of D8
CX
absence of D8
CX
labeled as CR ab
not labeled as CR cd
each term feature: information gain, AV
BE
-statistics
and probability ratio. Information gain (or aver-
age mutual information) is an information-theoretic
measure defined by the amount of reduced uncer-
tainty given a piece of information. We use the in-
formation gain value as the weight of each term for
the class CR, and the value is calculated using a docu-
ment event model as follows:
DB
CXCR
BP C0B4BVB5 A0 C0B4BVCYCF
CX
B5 (11)
BP
CG
CR
D7
BECUCRBNAMCRCV
CG
DB
D8
BECUDB
CX
BN AMDB
CX
CV
D4B4CR
D7
BNDB
D7
B5D0D3CV
D4B4CR
D7
BNDB
D8
B5
D4B4CR
D7
B5D4B4DB
D8
B5
where, for example, D4B4CRB5 is the number of docu-
ments belonging to the class CR divided by the total
number of documents, and D4B4AMDBB5 is the number of
documents without the term DB divided by the total
number of documents, etc.
Second measure we used is AV
BE
- statistics devel-
oped for the statistical test of the hypothesis. In the
text classification, given a two-way contingency ta-
ble for each term D8
CX
and class CR as represented in Ta-
ble 1, DB
CXCR
is calculated as follows:
DB
CXCR
BP
B4CPCS A0 CQCRB5
BE
B4CP B7 CQB5B4CP B7 CRB5B4CQ B7 CSB5B4CR B7 CSB5
(12)
where, CP,CQ,CR and CS indicate the number of documents
for each cell in the above contingency table.
(Yang and Pedersen, 1997) compared the various
feature selection methods, and concluded that these
two measures are most effective for their kNN and
LLSF classification models.
Finally, we introduce a new measure - probability
ratio. Probability ratio is defined by,
DB
CXCR
BP
D4B4DB
CX
CYCRB5
D4B4DB
CX
CYAMCRB5
B7
D4B4DB
CX
CYAMCRB5
D4B4DB
CX
CYCRB5
(13)
This measure calculates the sum of the ratio of two
class-conditional probabilities from each class and
its reciprocal. The former term and the latter term
are representing the degree of predicting positive
and negative class respectively. The weight using
this measure always has a positive value higher than
2.
We have conducted the experiments with these
three measures for the feature weighting test, and
the results are given in Section 4.
3.4 Implementation Issues
By a couple of simple arithmetic operations, our fi-
nal DE
CYCR
function can be rewritten as follows:
DE
CYCR
BP
AS
W
CR
B4BT
CR
B7B4BU
CR
B7CMDE
CYCR
B5
BD
CSD0
CY
BC
B5 (14)
where,
BT
CR
BP
CYCE CY
CG
CXBPBD
DB
CXCR
B4DIAM
CX
BC
A0
DI
AL
CX
BC
B5
BU
CR
BP AI
CYCE CY
CG
CXBPBD
DB
CXCR
D0D3CV
DI
AL
CX
BC
DIAM
CX
BC
CMDE
CYCR
BP
CG
BKCXBND8
CX
BECS
CY
DB
CXCR
DC
CXCY
D0D3CV
DI
AL
CX
BC
DIAM
CX
BC
DI
AL
CX
BC
BP
BD
AS
DI
AL
CX
BP
CG
CS
CY
BEBW
CR
CVB4CS
CY
CYCRB5 A1
DI
CU
CXCY
AS
CSD0
BC
CY
BP CSD0
CY
B7 AICYCE CY
In this equation,
DI
AL
CX
BC
and DIAM
CX
BC
are just weighted av-
erage of AS-dropped
DI
CU
CXCY
, that is,
DC
CXCY
B7AI
CSD0
CY
B7AICYCE CY
. CF
CR
, BT
CR
and
BU
CR
are the class-specific constants, and AS is a con-
stant over all the classes and documents. If the class
CR is fixed, CF
CR
, BT
CR
and AS can be dropped, and the
ranking function DE
A3
CYCR
is defined as follows:
DE
A3
CYCR
BPB4BU
CR
B7CMDE
CYCR
B5
BD
CSD0
CY
BC
(15)
When we use this ranking function DE
A3
CYCR
, the calcu-
lation of the exact posterior probability D4B4CRCYCS
CY
B5 pre-
sented in Section 2 becomes impossible. However,
it is trivial since most of IR systems do not have in-
terest on exact posterior probability. In addition, all
the parameters in our model is guaranteed to be cal-
culated by the incremental way.
0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0 0.2 0.4 0.6 0.8 1
MicroF1
alpha
Reuters21578
PNB
Unigram Model
Figure 1: MicroF1 Performances for Reuters21578
according to interpolation parameter AB for estimat-
ing AL and AM(without feature weighting)
4 Experimental Result
4.1 Data and Evaluation Measure
Our experiments were performed on the two
datasets: Reuters21578 and KoreanNews2002 col-
lection. Reuters21578 collection is the most widely
used benchmark dataset for the text categorization
research. We have used “ModApte” split version,
which consists of 9603 training documents and 3299
test documents. There are 90 categories, and each
document has one or more of the categories.
We have built another benchmark collection - Ko-
reanNews2002 collection. KoreanNews2002 collec-
tion is composed of 15,000 news articles published
during the year of 2002. The articles are collected
from a number of Korean news portal websites, and
each article is labeled with exactly one of the 46
classes. All the documents have date stamps at-
tached and have been ordered according to their date
stamps. With this date order, we divided them into
the former 10,000 documents for training and the
latter 5,000 documents for testing.
The performances are evaluated using popular F1
measure, and the F1 values for each class are micro-
averaged(MicroF1) and macro-averaged(MacroF1)
to examine the general classification performances.
4.2 Proposed Model : PNB (vs. UM)
Figure 1 shows the performances of our new model
named Poisson naive Bayes(PNB) classifiers ac-
Table 2: Performances of UM and PNB on the
Reuters21578 collection
UM PNB(min) PNB(max)
MicroF1 0.7212 0.7644 0.7706
MacroF1 0.3214 0.4227 0.4358
Table 3: Performances of UM and PNB on the Ko-
reanNews2002 collection
UM PNB(min) PNB(max)
MicroF1 0.6502 0.7031 0.7094
MacroF1 0.5208 0.5859 0.5949
cording to the interpolation parameter AB for estimat-
ing Poisson mean AL and AM. The baseline method
is a unigram model classifier (UM) which is also
referred to multinomial naive Bayes classifier de-
scribed in (McCallum and Nigam, 1998). Our pro-
posed PNB clearly outperforms the UM.
Although there is no significant difference of Mi-
croF1 values among the various AB values, the F1
value of each class is considerably affected by the
AB values. Figure 2 presents the fluctuations of the
F1 values for 4 classes in Reuters21578 collection.
From this result, we can assume that there is no
global optimal value of AB, but each class has its own
optimal AB. In our experiments, many of the classes
have the highest F1 value when AB is about 0.8 or
0.9 except some classes such as corn class which
shows the highest F1 value at AB BP BCBMBF. Similar
results are obtained in the KoreanNews2002 collec-
tions.
Table 2 and 3 shows the MicroF1 and MacroF1
values of the unigram model classifiers and our
PNB on the two collections, where PNB(min) and
PNB(max) are the highest and lowest values at dif-
ferent AB. In any cases, PNB is superior to UM.
4.3 Feature Weighting : PNB-CUIG,CHI,PrRCV
We have fixed the interpolation parameter AB at 0.8,
and evaluated the following feature weighting meth-
ods: PNB-IG with information gain, PNB-CHI with
AV
BE
-statistic, and PNB-PrR with probability ratio. In
these experiments, some important behaviors of fea-
ture weighted PNB classifiers are observed from the
results. In order to explain the phenomenon, we
have grouped the classes into the bins according to
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0 0.2 0.4 0.6 0.8 1
F1
alpha
Reuters21578 - acq
PNB
Unigram Model
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0 0.2 0.4 0.6 0.8 1
F1
alpha
Reuters21578 - grain
PNB
Unigram Model
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0 0.2 0.4 0.6 0.8 1
F1
alpha
Reuters21578 - interest
PNB
Unigram Model
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0 0.2 0.4 0.6 0.8 1
F1
alpha
Reuters21578 - corn
PNB
Unigram Model
Figure 2: Performances for 4 categories in
Reuters21578 according to interpolation parameter
AB for estimating AL and AM(without feature weighting)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 100 1000
MacroF1 for each bean
avg # of train doc for each bean
Reuters21578
PNB
PNB-IG
PNB-CHI
PNB-PrR
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 100 200 300 400 500 600 700
MacroF1 for each bean
Avg # of train docs for each bean
KoreanNews
PNB
PNB-IG
PNB-CHI
PNB-PrR
Figure 3: MacroF1 performances of the bins on Re-
tures21578 and KoreanNews2002
the number of training documents for each class. 5
bins are generated in both Reuters21578 and Kore-
anNews2002 collection.
The different average F1 performance of each bin
is shown in Figure 3. The clear observation from
this result is that feature weighting is highly effec-
tive in the bins of the classes with a small num-
ber of training documents, but hardly contributes
the performances for the bins of the classes with
sufficiently many training documents. In the bins
with enough training documents, simple PNB clas-
sifiers show the similar performances to the PNB
with feature weighting methods. This tendency is
more clearly captured in the Reuters21578 collec-
tion, where a third of the classes have fewer than
10 training documents. In contrast, two thirds of
the classes in the KoreanNews2002 collection have
more than a hundred of training documents.
Among the feature weighting methods, PNB-
PrR performs stably than PNB-IG and PNB-CHI.
PNB-IG or PNB-CHI somewhat degrades the per-
formance in the classes with the large number of
training documents, while PNB-PrR maintains the
good performances in those classes on both of the
collections. On the other hand, PNB-IG and PNB-
CHI considerably improve the performances in the
rare categories though the improvement is some-
what different from the two collections. For ex-
ample, PNB-CHI significantly improves the simple
PNB on the Reuters21578 collection while PNB-
IG is very effective on the KoreanNews2002 collec-
tion. Thus, we can realize that the proper feature
weighting method depends on the characteristics of
the collection, and different feature weighting strate-
gies should be adopted to improve the naive Bayes
text classification.
From these observations, we tested another clas-
sifier PNB
A3
which employ different feature weight-
ing method for each bin to obtain the near opti-
mal performances. Table 4 and 5 show the sum-
mary of the performances including PNB
A3
on the
both collections. Our proposed model with feature
weighting methods are very effective compared to
the baseline UM method. Moreover, the perfor-
mance of bin-optimized PNB
A3
in Reuters21578 col-
lection shows that Poisson naive Bayes with feature
weighting methods can achieve the state-of-the-art
performances achieved by SVM or kNN which are
reported in (Yang and Liu, 1999; Joachims, 1998).
5 Conclusion and Future Work
In this paper, we propose a Poisson naive Bayes text
classification model with feature weighting. Our
new model uses the normalized and smoothed term
frequencies for each document, and Poisson param-
eters are calculated by weighted averaging the fre-
quencies over all training documents. Experimental
results show that the proposed model is quite use-
ful to build probabilistic text classification systems
without requiring any extra cost compared to the
traditional simple naive Bayes or unigram language
model classifiers.
Further improvement is achieved by a feature
weighting technique. In our experiments, three
measures including chi-square statistics, informa-
tion gain, and newly introduced probability ratio are
Table 4: Summary of the performances on the Reuters21578 collection
UM PNB PNB-IG PNB-CHI PNB-PrR PNB
A3
MicroF1 0.7212 0.7690 0.7971 0.8167 0.8190(+13.56%) 0.8341
MacroF1 0.3414 0.4307 0.5800 0.6601(+93.35%) 0.5899 0.6645
Table 5: Summary of the performances on the KoreanNews2002 collection
UM PNB PNB-IG PNB-CHI PNB-PrR PNB
A3
MicroF1 0.6502 0.7056 0.7114 0.7122 0.7409(+13.95%) 0.7438
MacroF1 0.5208 0.5906 0.6305(+21.06%) 0.5748 0.6119 0.6662
adopted to weigh each term feature. The results
show that feature weighting considerably improves
the performances for the classes with a small num-
ber of training documents, but not for the classes
with the sufficient training documents. Probability
ratio also performs well, especially in the classes
with the great number of training documents where
other feature weighting methods show the unsatis-
factory performances.
For the future work, we will try to develop
some automatic methods of selecting proper feature
weighting measures and determining the interpola-
tion parameters for the different classes. Further-
more, we will explore applications of our approach
in other tasks such as adaptive filtering and relevance
feedback.

References
William S. Cooper, Fredric C. Gey, and Daniel P. Dabney.
1992. Probabilistic retrieval based on staged logsitic
regression. Proceedings of SIGIR-92, 15th ACM In-
ternational Conference on Research and Development
in Information Retrieval, pages 198–210.
Pedro Domingos and Michael J. Pazzani. 1997. On the
optimality of the simple bayesian classifier under zero-
one loss. Machine Learning, 29(2/3):103–130.
Susan Dumais, John Plat, David Heckerman, and Mehran
Sahami. 1998. Inductive learning algorithms and
representation for text categorization. Proceedings of
CIKM-98, 7th ACM International Conference on In-
formation and Knowledge Management, pages 148–
155.
Thorsten Joachims. 1998. Text categorization with sup-
port vector machines: learning with many relevant fea-
tures. Proceedings of ECML-98, 10th European Con-
ference on Machine Learning, pages 137–142.
Karen Sparck Jones, Steve Walker, and Stephen E.
Robertson. 2000. A probabilistic model of informa-
tion retrieval: development and comparative experi-
ments - part 1. Information Processing and Manage-
ment, 36(6):779–808.
David D. Lewis. 1992. Representation and learning
in information retrieval. Ph.D. thesis, Department
of Computer Science, University of Massachusetts,
Amherst, US.
David D. Lewis. 1998. Naive (Bayes) at forty: The in-
dependence assumption in information retrieval. Pro-
ceedings of ECML-98, 10th European Conference on
Machine Learning, pages 4–15.
Andrew K. McCallum and Kamal Nigam. 1998. Em-
ploying EM in pool-based active learning for text clas-
sification. Proceedings of ICML-98, 15th Interna-
tional Conference on Machine Learning, pages 350–
358.
Kamal Nigam, Andrew K. McCallum, Sebastian Thrun,
and Tom M. Mitchell. 2000. Text classification from
labeled and unlabeled documents using EM. Machine
Learning, 39(2/3):103–134.
Robert E. Schapire and Yoram Singer. 2000. BOOSTEX-
TER: a boosting-based system for text categorization.
Machine Learning, 39(2/3):135–168.
Yiming Yang and Christopher G. Chute. 1994. An
example-based mapping method for text categoriza-
tion and retrieval. ACM Transactions on Information
Systems, 12(3):252–277.
Yiming Yang and Xin Liu. 1999. A re-examination of
text categorization methods. Proceedings of SIGIR-
99, 22nd ACM International Conference on Research
and Development in Information Retrieval, pages 42–
49.
Yiming Yang and Jan O. Pedersen. 1997. A comparative
study on feature selection in text categorization. Pro-
ceedings of ICML-97, 14th International Conference
on Machine Learning, pages 412–420.
