File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0709_metho.xml

Size: 17,992 bytes

Last Modified: 2025-10-06 14:07:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0709">
  <Title>A Mixture-of-Experts Framework for Text Classification</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Mixture-of-Experts Scheme
</SectionTitle>
    <Paragraph position="0"> The results obtained in the previous section suggest that it might be useful to combine oversampling and undersampling versions of C5.0 sampled at different rates. On the one hand, the combination of the oversampling and undersampling strategies may be useful given the fact that the two approaches are both useful in the presence of imbalanced data sets (cf. results of Section 2.1) and may learn a same concept in different ways.9 On the other hand, the combination of classifiers using different oversampling and undersampling rates may be useful since we may not be able to predict, in advance, which rate is optimal (cf. results of Section 2.2).</Paragraph>
    <Paragraph position="1"> We will now describe the combination scheme we designed to deal with the class imbalance problem. This combination scheme will be tested on a subset of the REUTERS-21578 text classification domain.10</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Architecture
</SectionTitle>
      <Paragraph position="0"> A combination scheme for inductive learning consists of two parts. On the one hand, we must decide which classifiers will be combined and on the other hand, we must decide how these classifiers will be combined. We begin our discussion with a description of the architecture of our mixture of experts scheme. This discussion explains undersampling point is caused by the fact that at this point, no negative examples are present in the training set.</Paragraph>
      <Paragraph position="1">  each case suggest that the two methods, indeed, do tackle the problem differently [see, (Estabrooks, 2000)].</Paragraph>
      <Paragraph position="2"> 10This combination scheme was first tested on DNF artificial domains and improved classification accuracy by 52 to 62% over the positive data and decreased the classification accuracy by only 7.5 to 13.1% over the negative class as compared to the accuracy of a single C5.0 classifier. See (Estabrooks, 2000) for more detail.</Paragraph>
      <Paragraph position="3"> which classifiers are combined and gives a general idea of how they are combined. The specifics of our combination scheme are motivated and explained in the subsequent section.</Paragraph>
      <Paragraph position="4"> In order for a combination method to be effective, it is necessary for the various classifiers that constitute the combination to make different decisions (Hansen, 1990). The experiments in Section 2 of this paper suggest that undersampling and oversampling at different rates will produce classifiers able to make different decisions, including some corresponding to the &amp;quot;optimal&amp;quot; undersampling or oversampling rates that could not have been predicted in advance. This suggests a 3-level hierarchical combination approach consisting of the output level, which combines the results of the oversampling and undersampling experts located at the expert level, which themselves each combine the results of 10 classifiers located at the classifier level and trained on data sets sampled at different rates. In particular, the 10 oversampling classifiers oversample the data at rates 10%, 20%, ... 100% (the positive class is oversampled until the two classes are of the same size) and the 10 undersampling classifiers undersample the negative class at rate 0%, 10%, ..., 90% (the negative class is undersampled until the two classes are of the same size). Figure 3 illustrates the architecture of this combination scheme that was motivated by (Shimshoni &amp; Intrator, 1998)'s</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Integrated Classification Machine.11
3.2 Detailed Combination Scheme
</SectionTitle>
      <Paragraph position="0"> Our combination scheme is based on two different facts: Fact #1: Within a single testing set, different testing points could be best classified by different single classifiers. (This is a general fact that can be true for any problem and any set of classifiers).</Paragraph>
      <Paragraph position="1"> Fact #2: In class imbalanced domains for which the positive training set is small and the negative training set is large, classifiers tend to make many false-negative errors. (This is 11However, (Shimshoni &amp; Intrator, 1998) is a general architecture. It was not tuned to the imbalance problem, nor did it take into consideration the use of oversampling and undersampling to inject principled variance into the differ- null a well-known fact often reported in the literature on the class-imbalance problem and which was illustrated in Figure 1, above).</Paragraph>
      <Paragraph position="2"> In order to deal with the first fact, we decided not to average the outcome of different classifiers by letting them vote on a given testing point, but rather to let a single &amp;quot;good enough&amp;quot; classifier make a decision on that point. The classifier selected for a single data point needs not be the same as the one selected for a different data point. In general, letting a single, rather than several classifiers decide on a data point is based on the assumption that the instance space may be divided into non-overlapping areas, each best classified by a different expert. In such a case, averaging the result of different classifiers may not yield the best solution. We, thus, created a combination scheme that allowed single but different classifiers to make a decision for each point.</Paragraph>
      <Paragraph position="3"> Of course, such an approach is dangerous given that if the single classifier chosen to make a decision on a data point is not reliable, the result for this data point has a good chance of being unreliable as well. In order to prevent such a problem, we designed an elimination procedure geared at preventing any unfit classifier present at our architecture's classification level from participating in the decision-making process. This elimination program relies on our second fact in that it invalidates any classifier labeling too many examples as positive. Since the classifiers of the combination scheme have a tendency of being naturally biased towards classifying the examples as negative, we assume that a classifier making too many positive decision is probably doing so unreliably.</Paragraph>
      <Paragraph position="4"> In more detail, our combination scheme consists of a36 a combination scheme applied to each expert at the expert level a36 a combination scheme applied at the output level a36 an elimination scheme applied to the classifier level The expert and output level combination schemes use the same very simple heuristic: if one of the non-eliminated classifiers decides that an example is positive, so does the expert to which this classifier belongs. Similarly, if one of the two experts decides (based on its classifiers' decision) that an example is positive, so does the output level, and thus, the example is classified as positive by the overall system.</Paragraph>
      <Paragraph position="5"> The elimination scheme used at the classifier level uses the following heuristic: the first (most imbalanced) and the last (most balanced) classifiers of each expert are tested on an unlabeled data set. The number of positive classifications each classifier makes on the unlabeled data set is recorded and averaged and this average is taken as the threshold that none of the expert's classifiers must cross. In other words, any classifier that classifies more unlabeled data points as positive than the threshold established for the expert to which this classifier belongs needs to be discarded.12 null It is important to note that, at the expert and output level, our combination scheme is heavily biased towards the positive under-represented class. This was done as a way to compensate for the natural bias against the positive class embodied by the individual classifiers trained on the class imbalanced domain. This heavy positive bias, however, is mitigated by our elimination 12Because no labels are present, this technique constitutes an educated guess of what an appropriate threshold should be. This heuristic was tested in (Estabrooks, 2000) on the text classification task discussed below and was shown to improve the system (over the combination scheme not using this heuristic) by 3.2% when measured according to thea37a39a38 measure, 0.36% when measured according to the a37a41a40 measure, and 5.73% when measured according to thea37a27a42a44a43a45 measure. See the next section, for a definition of the a37a41a46 measures, but note that the higher thea37 a46 value, the better. scheme which strenuously eliminates any classifier believed to be too biased towards the positive class.</Paragraph>
      <Paragraph position="6">  Our combination scheme was tested on a subset of the 10 top categories of the REUTERS-21578 Data Set. We first present an overview of the data, followed by the results obtained by our scheme on these data.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The Reuters-21578 Data
</SectionTitle>
      <Paragraph position="0"> The ten largest categories of the Reuters-21578 data set consist of the documents included in the classes of financial topics listed in Table 1:  Several typical pre-processing steps were taken to prepare the data for classification. First, the data was divided according to the ModApte split which consists of considering all labelled documents published before 04/07/87 as training data (9603 documents, altogether) and all labelled documents published on or after 04/07/87 as testing data (3299 documents altogether). The unlabelled documents represent 8676 documents and were used during the classifier elimination step. Second, the documents were transformed into feature vectors in several steps. Specifically, all the punctuation and numbers were removed and the documents were filtered through a stop word  stemmed using the Lovins stemmer14 and the 500 most frequently occurring features were used as the dictionary for the bag-of-word vectors representing each documents.15 Finally, the data set was divided into 10 concept learning problems where each problem consisted of a positive class containing 100 examples sampled from a single top 10 Reuters topic class and a negative class containing the union of all the examples contained in the other 9 top 10 Reuters classes. Dividing the Reuters multi-class data set into a series of two-class problems is typically done because considering the problem as a straight multiclass classification problem causes difficulties due to the high class overlapping rate of the documents, i.e., it is not uncommon for a document to belong to several classes simultaneously. Furthermore, although the Reuters Data set contains more than 100 examples in each of its top 10 categories (see Table 1), we found it more realistic to use a restricted number of positive examples.16 Having restricted the number of positive examples in each problem, it is interesting to note that the class imbalances in these problems is very high since it ranges from an imbalance ratio of 1:60 to one of 1:100 in favour of the negative class.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> The results obtained by our scheme on these data were pitted against those of C5.0 ran with the  number of words used (see, for example, (Scott &amp; Matwin 1999)), however, it was shown that this restricted size did not affect the results too negatively while it did reduce processing time quite significantly (see (Estabrooks 2000)). 16Indeed, very often in practical situations, we only have access to a small number of articles labeled &amp;quot;of interest&amp;quot; whereas huge number of documents &amp;quot;of no interest&amp;quot; are available 17Our scheme was compared to C5.0 ran with the Ada-boost option combining 20 classifiers. This was done in order to present a fair comparison to our approach which also uses 20 classifiers. It turns out, however, that the Ada-boost option provided only a marginal improvement over using a single version of C5.0 (which itself compares favorably to state-of-the-art approaches for this problem) (Estabrooks, 2000). Please, note that other experiments using C5.0 with the Ada-boost option combining fewer or more classifiers should be attempted as well since 20 classifiers might not be iments are reported in Figure 4 as a function of the micro-averaged (over the 10 different classification problems) a47a49a48 , a47a51a50 and a47a51a52a54a53a55 measures. In more detail, thea47a57a56 -measure is defined as:  In other words, precision corresponds to the proportion of examples classified as positive that are truly positive; recall corresponds to the proportion of truly positive examples that are classified as positive; thea47a57a56 -measure combines the precision and recall by a ratio specified by a97 . If a97 a24a98a22 , then precision and recall are considered as being of equal importance. Ifa97 a24a99a0 , then recall is considered to be twice as important as precision. If a97 a24a99a3a27a100a101a34 , then precision is considered to be twice as important as recall.</Paragraph>
    <Paragraph position="1"> Because 10 different results are obtained for each value of B and each combination system (1 result per classification problem), these results had to be averaged in order to be presented in the graph of Figure 4. We used the Micro-averaging technique which consists of a straight average of the F-Measures obtained in all the problems, by each combination system, and for each value of B.</Paragraph>
    <Paragraph position="2"> Using Micro-averaging has the advantage of giving each problem the same weight, independently of the number of positive examples they contain.</Paragraph>
    <Paragraph position="3"> The results in Figure 4 show that our combination scheme is much more effective than Ada-boost on both recall and precision. Indeed, Ada-boost gets an a47a6a48 measure of 52.3% on the data set while our combination scheme gets an a47a49a48 measure of 72.25%. If recall is considered as twice more important than precision, the results are even better. Indeed, the mixture-of-experts scheme gets ana47a51a50 -measure of 75.9% while Ada-boost obtains an a47 a50 -measure of 48.5%. On the other hand, if precision is considered as twice more important than recall, then the combination scheme is still effective, but not as effective C5.0-Ada-boost's optimal number on our problem.</Paragraph>
    <Paragraph position="4">  and the Mixture-of-Experts scheme on 10 text classification problems with respect to Ada-boost since it brings thea47a102a52a54a53a55 -measure on the reduced data set to only 73.61%, whereas Ada-Boost's performance amounts to 64.9%.</Paragraph>
    <Paragraph position="5"> The generally better performance displayed by our proposed system when evaluated using the a47a51a50 -measure and its generally worse performance when evaluated using the a47a51a52a54a53a55 -measure are not surprising, since we biased our system so that it classifies more data points as positive. In other words, it is expected that our system will correctly discover new positive examples that were not discovered by Ada-Boost, but will incorrectly label as positive examples that are not positive.</Paragraph>
    <Paragraph position="6"> Overall, however, the results of our approach are quite positive with respect to both precision and recall. Furthermore, it is important to note that this method is not particularly computationally intensive. In particular, its computation costs are comparable to those of commonly used combination methods, such as AdaBoost.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Conclusion and Future Work
</SectionTitle>
    <Paragraph position="0"> This paper presented an approach for dealing with the class-imbalance problem that consisted of combining different expressions of re-sampling based classifiers in an informed fashion. In particular, our combination system was built so as to bias the classifiers towards the positive set so as counteract the negative bias typically developed by classifiers facing a higher proportion of negative than positive examples. The positive bias we included was carefully regulated by an elimination strategy designed to prevent unreliable classifiers to participate in the process. The technique was shown to be very effective on a drastically imbalanced version of a subset of the REUTERS text classification task.</Paragraph>
    <Paragraph position="1"> There are different ways in which this study could be expanded in the future. First, our technique was used in the context of a very naive oversampling and undersampling scheme. It would be useful to apply our scheme to more sophisticated re-sampling approaches such as those of (Lewis &amp; Gale, 1994) and (Kubat &amp; Matwin, 1997). Second, it would be interesting to find out whether our combination approach could also improve on cost-sensitive techniques previously designed. Finally, we would like to test our technique on other domains presenting a large class imbalance.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> We would like to thank Rob Holte and Chris Drummond for their valuable comments. This research was funded in part by an NSERC grant.</Paragraph>
    <Paragraph position="1"> The work described in this paper was conducted</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML