File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0709_intro.xml
Size: 8,614 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0709"> <Title>A Mixture-of-Experts Framework for Text Classification</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Experimental Study </SectionTitle> <Paragraph position="0"> We begin this work by studying the effects of oversampling versus undersampling and oversampling or undersampling at different rates.2 All the experiments in this part of the paper are conducted over artificial data sets defined over the domain of 4 x 7 DNF expressions, where the first number represents the number of literals present in each disjunct and the second number represents the number of disjuncts in each concept.3 We used an alphabet of size 50. For each concept, we created a training set containing 240 positive and 6000 negative examples. In other words, we 2Throughout this work, we consider a fixed imbalance ratio, a fixed number of training examples and a fixed degree of concept complexity. A thorough study relating different degrees of imbalance ratios, training set sizes and concept difficulty was previously reported in (Japkowicz, 2000). 3DNF expressions were specifically chosen because of their simplicity as well as their similarity to text data whose classification accuracy we are ultimately interested in improving. In particular, like in the case of text-classification, DNF concepts of interest are, generally, represented by much fewer examples than there are counter-examples of these concepts, especially when 1) the concept at hand is fairly specific; 2) the number of disjuncts and literals per disjunct grows larger; and 3) the values assumed by the literals are drawn from a large alphabet. Furthermore, an important aspect of concept complexity can be expressed in similar ways in DNF and textual concepts since adding a new subtopic to a textual concept corresponds to adding a new disjunct to a DNF concept.</Paragraph> <Paragraph position="1"> created an imbalance ratio of 1:25 in favor of the negative class.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Re-Sampling versus Downsizing </SectionTitle> <Paragraph position="0"> In this part of our study, three sets of experiments were conducted. First, we trained and tested C5.0 on the 4x7 DNF 1:25 imbalanced data sets just mentioned.4 Second, we randomly oversampled the positive class, until its size reached the size of the negative class, i.e., 6000 examples. The added examples were straight copies of the data in the original positive class, with no noise added.</Paragraph> <Paragraph position="1"> Finally, we undersampled the negative class by randomly eliminating data points from the negative class until it reached the size of the positive class or, 240 data points. Here again, we used a straightforward random approach for selecting the points to be eliminated. Each experiment was repeated 50 times on different 4x7 DNF concepts and using different oversampled or removed examples. After each training session, C5.0 was tested on separate testing sets containing 1,200 positive and 1,200 negative examples. The average accuracy results are reported in Figure 1. The left side of Figure 1 shows the results obtained on the positive testing set while its right side shows the results obtained on the negative testing set.</Paragraph> <Paragraph position="2"> As can be expected, the results show that the number of false negatives (results over the pos4(Estabrooks, 2000) reports results on 4 other concept sizes. An imbalanced ratio of 1:5 was also tried in preliminary experiments and caused a loss of accuracy about as large as the 1:25 ratio. Imbalanced ratios greater than 1:25 were not tried on this particular problem since we did not want to confuse the imbalance problem with the small sample problem.</Paragraph> <Paragraph position="3"> itive class) is a lot higher than the number of false positives (results over the negative class). As well, the results suggest that both naive oversampling and undersampling are helpful for reducing the error caused by the class imbalance on this problem although oversampling appears more accurate than undersampling.5</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2. Re-Sampling and Down-Sizing at various Rates </SectionTitle> <Paragraph position="0"> In order to find out what happens when different sampling rates are used, we continued using the imbalanced data sets of the previous section, but rather than simply oversampling and undersampling them by equalizing the size of the positive and the negative set, we oversampled and undersampled them at different rates. In particular, we divided the difference between the size of the positive and negative training sets by 10 and used this value as an increment in our oversampling and undersampling experiments. We chose to make the 100% oversampling rate correspond to the fully oversampled data sets of the previous section but to make the 90% undersampled rate correspond to the fully undersampled data sets of the previous section.6 For example, data sets with a 10% oversampling rate containa0a2a1a4a3a6a5a8a7a10a9a12a11a13a3a14a3a14a3a16a15a17a0a2a1a4a3a19a18a21a20a12a22a23a3a25a24 a26a27a22a28a9 positive examples and 6,000 negative examples. Conversely, data sets with a 0% undersampling rate contain 240 positive examples and 6,000 negative ones while data sets with a 10% undersampling rate contain 240 positive examples anda9a12a11a13a3a14a3a14a3a29a15a30a7a10a9a12a11a13a3a14a3a14a3a31a15a32a0a2a1a4a3a19a18a21a20a12a22a23a3a33a24a35a34a2a1a19a0a2a1 negative examples. A 0% oversampling rate and a 90% undersampling rate correspond to the fully imbalanced data sets designed in the previous section while a 100% undersampling rate corresponds to the case where no negative examples are present in the training set.</Paragraph> <Paragraph position="1"> Once again, and for each oversampling and undersampling rate, the rules learned by C5.0 on the training sets were tested on testing sets containing 1,200 positive and 1,200 negative examples.</Paragraph> <Paragraph position="2"> 5Note that the usefulness of oversampling versus undersampling is problem dependent. (Domingos, 1999), for example, finds that in some experiments, oversampling is more effective than undersampling, although in many cases, the opposite can be observed.</Paragraph> <Paragraph position="3"> 6This was done so that no classifier was duplicated in our combination scheme. (See Section 3) The results of our experiments are displayed in Figure 2 for the case of oversampling and undersampling respectively. They represent the averages of 50 trials. Again, the results are reported separately for the positive and the negative testing sets.</Paragraph> <Paragraph position="4"> ferent Rates These results suggest that different sampling rates have different effects on the accuracy of C5.0 on imbalanced data sets for both the oversampling and the undersampling method. In particular, the following observation can be made: Oversampling or undersampling until a cardinal balance of the two classes is reached is not necessarily the best strategy: best accuracies are reached before the two sets are cardinally balanced.</Paragraph> <Paragraph position="5"> In more detail, this observation comes from the fact that in both the oversampling and undersampling curves of figure 2 the optimal accuracy is not obtained when the positive and the negative classes have the same size. In the oversampling curves, where class equality is reached at the 100% oversampling rate, the average error rate obtained on the data sets over the positive class at that point is 35.3% (it is of 0.45% over the negative class) whereas the optimal error rate is obtained at a sampling rate of 70% (with an error rate of 22.23% over the positive class and of 0.56% over the negative class). Similarly, although less significantly, in the undersampling curves, where class equality is reached at the 90% undersampling rate7, the average error rate ob7The sharp increase in error rate taking place at the 100% tained at that point is worse than the one obtained at a sampling rate of 80% since although the error rate is the same over the positive class (at 38.72%) it went from 1.84% at 90% oversampling over the negative class to 7.93%.8 In general, it is quite likely that the optimal sampling rates can vary in a way that might not be predictable for various approaches and problems. null</Paragraph> </Section> </Section> class="xml-element"></Paper>