﻿A Mixture-of-Experts Framework for Text Classification
Andrew Estabrooks Nathalie Japkowicz
IBM Toronto Lab, Office 1B28B, SITE, University of Ottawa,
1150 Eglinton Avenue East, 150 Louis Pasteur, P.O. Box 450 Stn. A,
North York, Ontario, Canada, M3C 1H7 Ottawa, Ontario, Canada K1N 6N5
aestabro@ca.ibm.com nat@site.uottawa.ca
Abstract
One of the particular characteristics
of text classification tasks is that they
present large class imbalances. Such a
problem can easily be tackled using re-
sampling methods. However, although
these approaches are very simple to im-
plement, tuning them most effectively
is not an easy task. In particular,
it is unclear whether oversampling is
more effective than undersampling and
which oversampling or undersampling
rate should be used. This paper presents
a method for combining different ex-
pressions of the re-sampling approach
in a mixture of experts framework. The
proposed combination scheme is evalu-
ated on a very imbalanced subset of the
REUTERS-21578 text collection and is
shown to be very effective on this do-
main.
1 Introduction
A typical use of Machine Learning methods in
the context of Natural Language Processing is in
the domain of text classification. Unfortunately,
several characteristics specific to text data make
its classification a difficult problem to handle. In
particular, the data is typically highly dimensional
and it presents a large class imbalance, i.e., there,
typically, are very few documents on the topic of
interest while texts on unrelated subjects abound.
Furthermore, although large amounts of texts are
available on line, little of them are labeled. Be-
cause the class imbalance problem is known to
negatively affect typical classifiers and because
unlabeled data have no place in conventional su-
pervised learning, using off-the-shelf supervised
classifiers is likely not to be very successful in
the context of text data. It is, instead, recom-
mended to devise a classification method specifi-
cally tuned to the text classification problem.
The purpose of this study is to target some of
the characteristics of text data in the hope of im-
proving the effectiveness of the classification pro-
cess. The topics of finding a good representa-
tion for text data and dealing with its high di-
mensionality have been investigated previously
with, for example, the use of Wordnet [e.g., (Scott
& Matwin, 1999)] and Support Vector Machines
[e.g., (Joachims, 1998)], respectively. We will not
be addressing these problems here. The question
that we will tackle in this paper, instead, is that
of dealing with the class imbalance, and, in the
process of doing so, that of finding a way to take
advantage of the extra, albeit, unlabeled data that
are often left unused in classification studies.1
Several approaches have previously been pro-
posed to deal with the class imbalance problem
including a simple and yet quite effective method:
re-sampling [e.g., (Lewis & Gale, 1994), (Kubat
& Matwin, 1997), (Domingos, 1999)]. This paper
deals with the two different types of re-sampling
approaches: methods that oversample the small
class in order to make it reach a size close to that
of the larger class and methods that undersample
the large class in order to make it reach a size
close to that of the smaller class. Because it is
unclear whether oversampling is more effective
than undersampling and which oversampling or
undersampling rate should be used, we propose a
1Note, however, that unlabeled data is not always left un-
used as in the work on co-learning of (Blum & Mitchell,
1998). As discussed below, however, our approach will
make use of the unlabeled data in a different way.
method for combining a number of classifiers that
oversample and undersample the data at differ-
ent rates in a mixture of experts framework. The
mixture-of-experts is constructed in the context of
a decision tree induction system: C5.0, and all re-
sampling is done randomly. This proposed com-
bination scheme is, subsequently, evaluated on a
a subset of the REUTERS-21578 text collection
and is shown to be very effective in this case.
The remainder of this paper is divided into
four sections. Section 2 describes an experi-
mental study on a series of artificial data sets
to explore the effect of oversampling and under-
sampling and oversampling or undersampling at
different rates. This study suggests a mixture-
of-experts scheme which is described in Section
3. Section 4 discusses the experiment conducted
with that mixture-of-experts scheme on a series
of text-classification tasks and discusses their re-
sults. Section 5 is the conclusion.
2 Experimental Study
We begin this work by studying the effects of
oversampling versus undersampling and over-
sampling or undersampling at different rates.2 All
the experiments in this part of the paper are con-
ducted over artificial data sets defined over the do-
main of 4 x 7 DNF expressions, where the first
number represents the number of literals present
in each disjunct and the second number represents
the number of disjuncts in each concept.3 We
used an alphabet of size 50. For each concept,
we created a training set containing 240 positive
and 6000 negative examples. In other words, we
2Throughout this work, we consider a fixed imbalance
ratio, a fixed number of training examples and a fixed degree
of concept complexity. A thorough study relating different
degrees of imbalance ratios, training set sizes and concept
difficulty was previously reported in (Japkowicz, 2000).
3DNF expressions were specifically chosen because of
their simplicity as well as their similarity to text data whose
classification accuracy we are ultimately interested in im-
proving. In particular, like in the case of text-classification,
DNF concepts of interest are, generally, represented by
much fewer examples than there are counter-examples of
these concepts, especially when 1) the concept at hand is
fairly specific; 2) the number of disjuncts and literals per dis-
junct grows larger; and 3) the values assumed by the literals
are drawn from a large alphabet. Furthermore, an impor-
tant aspect of concept complexity can be expressed in sim-
ilar ways in DNF and textual concepts since adding a new
subtopic to a textual concept corresponds to adding a new
disjunct to a DNF concept.
0
10
20
30
40
50
60
Error over Positive Data          Error over Negative Data
Error Rate (%)
Imbalanced 
Re−Sampling
Down−Sizing
Figure 1: Re-Sampling versus Downsizing
created an imbalance ratio of 1:25 in favor of the
negative class.
2.1 Re-Sampling versus Downsizing
In this part of our study, three sets of experiments
were conducted. First, we trained and tested C5.0
on the 4x7 DNF 1:25 imbalanced data sets just
mentioned.4 Second, we randomly oversampled
the positive class, until its size reached the size
of the negative class, i.e., 6000 examples. The
added examples were straight copies of the data
in the original positive class, with no noise added.
Finally, we undersampled the negative class by
randomly eliminating data points from the neg-
ative class until it reached the size of the positive
class or, 240 data points. Here again, we used
a straightforward random approach for selecting
the points to be eliminated. Each experiment was
repeated 50 times on different 4x7 DNF concepts
and using different oversampled or removed ex-
amples. After each training session, C5.0 was
tested on separate testing sets containing 1,200
positive and 1,200 negative examples. The aver-
age accuracy results are reported in Figure 1. The
left side of Figure 1 shows the results obtained on
the positive testing set while its right side shows
the results obtained on the negative testing set.
As can be expected, the results show that the
number of false negatives (results over the pos-
4(Estabrooks, 2000) reports results on 4 other concept
sizes. An imbalanced ratio of 1:5 was also tried in prelim-
inary experiments and caused a loss of accuracy about as
large as the 1:25 ratio. Imbalanced ratios greater than 1:25
were not tried on this particular problem since we did not
want to confuse the imbalance problem with the small sam-
ple problem.
itive class) is a lot higher than the number of
false positives (results over the negative class). As
well, the results suggest that both naive oversam-
pling and undersampling are helpful for reducing
the error caused by the class imbalance on this
problem although oversampling appears more ac-
curate than undersampling.5
2.2. Re-Sampling and Down-Sizing at various
Rates
In order to find out what happens when different
sampling rates are used, we continued using the
imbalanced data sets of the previous section, but
rather than simply oversampling and undersam-
pling them by equalizing the size of the positive
and the negative set, we oversampled and under-
sampled them at different rates. In particular, we
divided the difference between the size of the pos-
itive and negative training sets by 10 and used this
value as an increment in our oversampling and un-
dersampling experiments. We chose to make the
100% oversampling rate correspond to the fully
oversampled data sets of the previous section but
to make the 90% undersampled rate correspond to
the fully undersampled data sets of the previous
section.6 For example, data sets with a 10% over-
sampling rate containa0a2a1a4a3a6a5a8a7a10a9a12a11a13a3a14a3a14a3a16a15a17a0a2a1a4a3a19a18a21a20a12a22a23a3a25a24
a26a27a22a28a9 positive examples and 6,000 negative exam-
ples. Conversely, data sets with a 0% under-
sampling rate contain 240 positive examples and
6,000 negative ones while data sets with a 10%
undersampling rate contain 240 positive examples
anda9a12a11a13a3a14a3a14a3a29a15a30a7a10a9a12a11a13a3a14a3a14a3a31a15a32a0a2a1a4a3a19a18a21a20a12a22a23a3a33a24a35a34a2a1a19a0a2a1 negative
examples. A 0% oversampling rate and a 90%
undersampling rate correspond to the fully imbal-
anced data sets designed in the previous section
while a 100% undersampling rate corresponds to
the case where no negative examples are present
in the training set.
Once again, and for each oversampling and un-
dersampling rate, the rules learned by C5.0 on the
training sets were tested on testing sets contain-
ing 1,200 positive and 1,200 negative examples.
5Note that the usefulness of oversampling versus under-
sampling is problem dependent. (Domingos, 1999), for ex-
ample, finds that in some experiments, oversampling is more
effective than undersampling, although in many cases, the
opposite can be observed.
6This was done so that no classifier was duplicated in our
combination scheme. (See Section 3)
The results of our experiments are displayed in
Figure 2 for the case of oversampling and under-
sampling respectively. They represent the aver-
ages of 50 trials. Again, the results are reported
separately for the positive and the negative testing
sets.
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Downsizing/Oversampling Rate (%)
Error Rate (%)
Downsize −− Positive Error 
Downsize −− Negative Error 
Re−Sample −− Positive Error
Re−Sample −− Negative Error
Cardinal
Balances 
Figure 2: Oversampling and Downsizing at Dif-
ferent Rates
These results suggest that different sampling
rates have different effects on the accuracy of
C5.0 on imbalanced data sets for both the over-
sampling and the undersampling method. In par-
ticular, the following observation can be made:
Oversampling or undersampling until a
cardinal balance of the two classes is
reached is not necessarily the best strat-
egy: best accuracies are reached before
the two sets are cardinally balanced.
In more detail, this observation comes from the
fact that in both the oversampling and under-
sampling curves of figure 2 the optimal accuracy
is not obtained when the positive and the neg-
ative classes have the same size. In the over-
sampling curves, where class equality is reached
at the 100% oversampling rate, the average er-
ror rate obtained on the data sets over the posi-
tive class at that point is 35.3% (it is of 0.45%
over the negative class) whereas the optimal error
rate is obtained at a sampling rate of 70% (with
an error rate of 22.23% over the positive class
and of 0.56% over the negative class). Similarly,
although less significantly, in the undersampling
curves, where class equality is reached at the 90%
undersampling rate7, the average error rate ob-
7The sharp increase in error rate taking place at the 100%
tained at that point is worse than the one obtained
at a sampling rate of 80% since although the error
rate is the same over the positive class (at 38.72%)
it went from 1.84% at 90% oversampling over the
negative class to 7.93%.8
In general, it is quite likely that the optimal
sampling rates can vary in a way that might not
be predictable for various approaches and prob-
lems.
3 The Mixture-of-Experts Scheme
The results obtained in the previous section sug-
gest that it might be useful to combine oversam-
pling and undersampling versions of C5.0 sam-
pled at different rates. On the one hand, the com-
bination of the oversampling and undersampling
strategies may be useful given the fact that the
two approaches are both useful in the presence of
imbalanced data sets (cf. results of Section 2.1)
and may learn a same concept in different ways.9
On the other hand, the combination of classifiers
using different oversampling and undersampling
rates may be useful since we may not be able to
predict, in advance, which rate is optimal (cf. re-
sults of Section 2.2).
We will now describe the combination scheme
we designed to deal with the class imbalance
problem. This combination scheme will be tested
on a subset of the REUTERS-21578 text classifi-
cation domain.10
3.1 Architecture
A combination scheme for inductive learning
consists of two parts. On the one hand, we must
decide which classifiers will be combined and on
the other hand, we must decide how these classi-
fiers will be combined. We begin our discussion
with a description of the architecture of our mix-
ture of experts scheme. This discussion explains
undersampling point is caused by the fact that at this point,
no negative examples are present in the training set.
8Further results illustrating this point over different con-
cept sizes can also be found in (Estabrooks, 2000).
9In fact, further results comparing C5.0’s rule sizes in
each case suggest that the two methods, indeed, do tackle
the problem differently [see, (Estabrooks, 2000)].
10This combination scheme was first tested on DNF ar-
tificial domains and improved classification accuracy by 52
to 62% over the positive data and decreased the classifica-
tion accuracy by only 7.5 to 13.1% over the negative class
as compared to the accuracy of a single C5.0 classifier. See
(Estabrooks, 2000) for more detail.
which classifiers are combined and gives a gen-
eral idea of how they are combined. The specifics
of our combination scheme are motivated and ex-
plained in the subsequent section.
In order for a combination method to be effec-
tive, it is necessary for the various classifiers that
constitute the combination to make different de-
cisions (Hansen, 1990). The experiments in Sec-
tion 2 of this paper suggest that undersampling
and oversampling at different rates will produce
classifiers able to make different decisions, in-
cluding some corresponding to the “optimal” un-
dersampling or oversampling rates that could not
have been predicted in advance. This suggests
a 3-level hierarchical combination approach con-
sisting of the output level, which combines the re-
sults of the oversampling and undersampling ex-
perts located at the expert level, which themselves
each combine the results of 10 classifiers located
at the classifier level and trained on data sets sam-
pled at different rates. In particular, the 10 over-
sampling classifiers oversample the data at rates
10%, 20%, ... 100% (the positive class is over-
sampled until the two classes are of the same size)
and the 10 undersampling classifiers undersam-
ple the negative class at rate 0%, 10%, ..., 90%
(the negative class is undersampled until the two
classes are of the same size). Figure 3 illustrates
the architecture of this combination scheme that
was motivated by (Shimshoni & Intrator, 1998)’s
Integrated Classification Machine.11
3.2 Detailed Combination Scheme
Our combination scheme is based on two different
facts:
Fact #1: Within a single testing set, different
testing points could be best classified by dif-
ferent single classifiers. (This is a general
fact that can be true for any problem and any
set of classifiers).
Fact #2: In class imbalanced domains for which
the positive training set is small and the neg-
ative training set is large, classifiers tend to
make many false-negative errors. (This is
11However, (Shimshoni & Intrator, 1998) is a general ar-
chitecture. It was not tuned to the imbalance problem, nor
did it take into consideration the use of oversampling and
undersampling to inject principled variance into the differ-
ent classifiers.
Figure 3: Re-Sampling versus Downsizing
a well-known fact often reported in the lit-
erature on the class-imbalance problem and
which was illustrated in Figure 1, above).
In order to deal with the first fact, we decided
not to average the outcome of different classi-
fiers by letting them vote on a given testing point,
but rather to let a single “good enough” classi-
fier make a decision on that point. The classi-
fier selected for a single data point needs not be
the same as the one selected for a different data
point. In general, letting a single, rather than sev-
eral classifiers decide on a data point is based on
the assumption that the instance space may be di-
vided into non-overlapping areas, each best clas-
sified by a different expert. In such a case, av-
eraging the result of different classifiers may not
yield the best solution. We, thus, created a com-
bination scheme that allowed single but different
classifiers to make a decision for each point.
Of course, such an approach is dangerous given
that if the single classifier chosen to make a deci-
sion on a data point is not reliable, the result for
this data point has a good chance of being unreli-
able as well. In order to prevent such a problem,
we designed an elimination procedure geared at
preventing any unfit classifier present at our ar-
chitecture’s classification level from participating
in the decision-making process. This elimination
program relies on our second fact in that it invali-
dates any classifier labeling too many examples as
positive. Since the classifiers of the combination
scheme have a tendency of being naturally biased
towards classifying the examples as negative, we
assume that a classifier making too many positive
decision is probably doing so unreliably.
In more detail, our combination scheme con-
sists of
a36 a combination scheme applied to each expert
at the expert level
a36 a combination scheme applied at the output
level
a36 an elimination scheme applied to the classi-
fier level
The expert and output level combination
schemes use the same very simple heuristic: if
one of the non-eliminated classifiers decides that
an example is positive, so does the expert to which
this classifier belongs. Similarly, if one of the two
experts decides (based on its classifiers’ decision)
that an example is positive, so does the output
level, and thus, the example is classified as pos-
itive by the overall system.
The elimination scheme used at the classifier
level uses the following heuristic: the first (most
imbalanced) and the last (most balanced) classi-
fiers of each expert are tested on an unlabeled
data set. The number of positive classifications
each classifier makes on the unlabeled data set is
recorded and averaged and this average is taken
as the threshold that none of the expert’s classi-
fiers must cross. In other words, any classifier
that classifies more unlabeled data points as pos-
itive than the threshold established for the expert
to which this classifier belongs needs to be dis-
carded.12
It is important to note that, at the expert and
output level, our combination scheme is heav-
ily biased towards the positive under-represented
class. This was done as a way to compensate
for the natural bias against the positive class em-
bodied by the individual classifiers trained on the
class imbalanced domain. This heavy positive
bias, however, is mitigated by our elimination
12Because no labels are present, this technique constitutes
an educated guess of what an appropriate threshold should
be. This heuristic was tested in (Estabrooks, 2000) on the
text classification task discussed below and was shown to
improve the system (over the combination scheme not using
this heuristic) by 3.2% when measured according to thea37a39a38
measure, 0.36% when measured according to the a37a41a40 mea-
sure, and 5.73% when measured according to thea37a27a42a44a43a45 mea-
sure. See the next section, for a definition of the a37a41a46 mea-
sures, but note that the higher thea37 a46 value, the better.
scheme which strenuously eliminates any classi-
fier believed to be too biased towards the positive
class.
4 Experiments on a Text Classification
Task
Our combination scheme was tested on a subset
of the 10 top categories of the REUTERS-21578
Data Set. We first present an overview of the data,
followed by the results obtained by our scheme on
these data.
4.1 The Reuters-21578 Data
The ten largest categories of the Reuters-21578
data set consist of the documents included in the
classes of financial topics listed in Table 1:
Class. Document Count
Earn 3987
ACQ 2448
MoneyFx 801
Grain 628
Crude 634
Trade 551
Interest 513
Wheat 306
Ship 305
Corn 254
Table 1: The top 10 Reuters-21578 categories
Several typical pre-processing steps were taken
to prepare the data for classification. First, the
data was divided according to the ModApte split
which consists of considering all labelled docu-
ments published before 04/07/87 as training data
(9603 documents, altogether) and all labelled
documents published on or after 04/07/87 as test-
ing data (3299 documents altogether). The unla-
belled documents represent 8676 documents and
were used during the classifier elimination step.
Second, the documents were transformed into
feature vectors in several steps. Specifically, all
the punctuation and numbers were removed and
the documents were filtered through a stop word
list13. The words in each document were then
13The stop word list was obtained at:
http://www.dcs.gla.ac.uk/idom/it resources/
linguistic utils/stop-words.
stemmed using the Lovins stemmer14 and the 500
most frequently occurring features were used as
the dictionary for the bag-of-word vectors repre-
senting each documents.15 Finally, the data set
was divided into 10 concept learning problems
where each problem consisted of a positive class
containing 100 examples sampled from a single
top 10 Reuters topic class and a negative class
containing the union of all the examples con-
tained in the other 9 top 10 Reuters classes. Di-
viding the Reuters multi-class data set into a se-
ries of two-class problems is typically done be-
cause considering the problem as a straight mul-
ticlass classification problem causes difficulties
due to the high class overlapping rate of the doc-
uments, i.e., it is not uncommon for a document
to belong to several classes simultaneously. Fur-
thermore, although the Reuters Data set contains
more than 100 examples in each of its top 10 cat-
egories (see Table 1), we found it more realistic
to use a restricted number of positive examples.16
Having restricted the number of positive exam-
ples in each problem, it is interesting to note that
the class imbalances in these problems is very
high since it ranges from an imbalance ratio of
1:60 to one of 1:100 in favour of the negative
class.
4.2 Results
The results obtained by our scheme on these data
were pitted against those of C5.0 ran with the
Ada-boost option.17 The results of these exper-
14The Lovins stemmer was obtained from:
ftp://n106.isitokushima-u.ac.ip/pub/IR/Iterated-Lovins-
stemmer
15A dictionary of 500 words is smaller than the typical
number of words used (see, for example, (Scott & Matwin
1999)), however, it was shown that this restricted size did
not affect the results too negatively while it did reduce pro-
cessing time quite significantly (see (Estabrooks 2000)).
16Indeed, very often in practical situations, we only have
access to a small number of articles labeled “of interest”
whereas huge number of documents “of no interest” are
available
17Our scheme was compared to C5.0 ran with the Ada-
boost option combining 20 classifiers. This was done in
order to present a fair comparison to our approach which
also uses 20 classifiers. It turns out, however, that the Ada-
boost option provided only a marginal improvement over us-
ing a single version of C5.0 (which itself compares favorably
to state-of-the-art approaches for this problem) (Estabrooks,
2000). Please, note that other experiments using C5.0 with
the Ada-boost option combining fewer or more classifiers
should be attempted as well since 20 classifiers might not be
iments are reported in Figure 4 as a function of
the micro-averaged (over the 10 different classi-
fication problems) a47a49a48 , a47a51a50 and a47a51a52a54a53a55 measures. In
more detail, thea47a57a56 -measure is defined as:
a47 a56
a24a59a58a56
a40a13a60
a48a62a61a44a63a12a64a49a63a27a65
a56
a40
a63a12a64
a60
a65
where P represents precision, and R, recall, which
are respectively defined as follows:
a66 a24 a67a39a68a70a69a72a71
a64a74a73a21a75a77a76a79a78a80a76a79a81
a71
a75
a67a39a68a70a69a72a71
a64a74a73a21a75a77a76a79a78a80a76a79a81
a71
a75
a60a83a82a74a84a86a85
a75
a71
a64a74a73a21a75a87a76a88a78a80a76a79a81
a71
a75
a89a91a90
a67a92a68a44a69a93a71
a64a74a73a21a75a87a76a88a78a80a76a79a81
a71
a75
a67a39a68a70a69a72a71
a64a74a73a70a75a87a76a79a78a80a76a79a81
a71
a75
a60a83a82a74a84a94a85
a75
a71a70a95a16a71a87a96
a84
a78a80a76a79a81
a71
a75
In other words, precision corresponds to the pro-
portion of examples classified as positive that are
truly positive; recall corresponds to the proportion
of truly positive examples that are classified as
positive; thea47a57a56 -measure combines the precision
and recall by a ratio specified by a97 . If a97 a24a98a22 ,
then precision and recall are considered as being
of equal importance. Ifa97 a24a99a0 , then recall is con-
sidered to be twice as important as precision. If
a97
a24a99a3a27a100a101a34 , then precision is considered to be twice
as important as recall.
Because 10 different results are obtained for
each value of B and each combination system
(1 result per classification problem), these results
had to be averaged in order to be presented in the
graph of Figure 4. We used the Micro-averaging
technique which consists of a straight average of
the F-Measures obtained in all the problems, by
each combination system, and for each value of B.
Using Micro-averaging has the advantage of giv-
ing each problem the same weight, independently
of the number of positive examples they contain.
The results in Figure 4 show that our combi-
nation scheme is much more effective than Ada-
boost on both recall and precision. Indeed, Ada-
boost gets an a47a6a48 measure of 52.3% on the data
set while our combination scheme gets an a47a49a48
measure of 72.25%. If recall is considered as
twice more important than precision, the results
are even better. Indeed, the mixture-of-experts
scheme gets ana47a51a50 -measure of 75.9% while Ada-
boost obtains an a47 a50 -measure of 48.5%. On the
other hand, if precision is considered as twice
more important than recall, then the combina-
tion scheme is still effective, but not as effective
C5.0-Ada-boost’s optimal number on our problem.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
AdaBoost versus the Mixture−of−Experts Scheme
F−Measure
AdaBoost 
OurScheme
F−1 F−2 F−0.5 
Figure 4: Average results obtained by Ada-Boost
and the Mixture-of-Experts scheme on 10 text
classification problems
with respect to Ada-boost since it brings thea47a102a52a54a53a55 -
measure on the reduced data set to only 73.61%,
whereas Ada-Boost’s performance amounts to
64.9%.
The generally better performance displayed by
our proposed system when evaluated using the
a47a51a50 -measure and its generally worse performance
when evaluated using the a47a51a52a54a53a55 -measure are not
surprising, since we biased our system so that it
classifies more data points as positive. In other
words, it is expected that our system will cor-
rectly discover new positive examples that were
not discovered by Ada-Boost, but will incorrectly
label as positive examples that are not positive.
Overall, however, the results of our approach are
quite positive with respect to both precision and
recall. Furthermore, it is important to note that
this method is not particularly computationally
intensive. In particular, its computation costs are
comparable to those of commonly used combina-
tion methods, such as AdaBoost.
5 Conclusion and Future Work
This paper presented an approach for dealing
with the class-imbalance problem that consisted
of combining different expressions of re-sampling
based classifiers in an informed fashion. In par-
ticular, our combination system was built so as to
bias the classifiers towards the positive set so as
counteract the negative bias typically developed
by classifiers facing a higher proportion of nega-
tive than positive examples. The positive bias we
included was carefully regulated by an elimina-
tion strategy designed to prevent unreliable clas-
sifiers to participate in the process. The technique
was shown to be very effective on a drastically
imbalanced version of a subset of the REUTERS
text classification task.
There are different ways in which this study
could be expanded in the future. First, our tech-
nique was used in the context of a very naive over-
sampling and undersampling scheme. It would be
useful to apply our scheme to more sophisticated
re-sampling approaches such as those of (Lewis &
Gale, 1994) and (Kubat & Matwin, 1997). Sec-
ond, it would be interesting to find out whether
our combination approach could also improve on
cost-sensitive techniques previously designed. Fi-
nally, we would like to test our technique on other
domains presenting a large class imbalance.
Acknowledgements
We would like to thank Rob Holte and Chris
Drummond for their valuable comments. This re-
search was funded in part by an NSERC grant.
The work described in this paper was conducted
at Dalhousie University.

References
Blum, A. and Mitchell, T. (1998): Combining Labeled
and Unlabeled Data with Co-Training Proceedings
of the 1998 Conference on Computational Learning
Theory.
Domingos, Pedro (1999): Metacost: A general
method for making classifiers cost sensitive, Pro-
ceedings of the Fifth International Conference on
Knowledge Discovery and Data Mining, 155–164.
Estabrooks, Andrew (2000): A Combination Scheme
for Inductive Learning from Imbalanced Data Sets,
MCS Thesis, Faculty of Computer Science, Dal-
housie University.
Hansen, L. K. and Salamon, P. (1990): Neural Net-
work Ensembles, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 12(10), 993–
1001.
Japkowicz, Nathalie (2000): The Class Imbalance
Problem: Significance and Strategies, Proceedings
of the 2000 International Conference on Artificial
Intelligence (IC-AI’2000), 111–117.
Joachims, T. (1998): Text Categorization with Support
Vector Machines: Learning with many relevant fea-
tures, Proceedings of the 1998 European Confer-
ence on Machine Learning.
Kubat, Miroslav and Matwin, Stan (1997): Address-
ing the Curse of Imbalanced Data Sets: One-Sided
Sampling, Proceedings of the Fourteenth Interna-
tional Conference on Machine Learning, 179–186.
Lewis, D. and Gale, W. (1994): Training Text Classi-
fiers by Uncertainty Sampling, Proceedings of the
Seventh Annual International ACM SIGIR Confer-
ence on Research and Development in Information
Retrieval.
Scott, Sam and Matwin, Stan (1999): Feature En-
gineering for Text Classification, Proceedings of
the Sixteenth International Conference on Machine
Learning, 379–388.
Shimshoni, Y. and Intrator, N. (1998): Classifying
Seismic Signals by Integrating Ensembles of Neural
Networks, IEEE Transactions On Signal Process-
ing, Special issue on NN.
