File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1026_evalu.xml

Size: 9,150 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1026">
  <Title>Manipulating Large Corpora for Text Classification</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data and Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> We evaluated the method using the 1996 Reuters corpus recently made available. The corpus from 20th Aug. to 31st Dec. consists of 279,303 documents. These documents are organized into 126 categories with a four level hierarchy. We selected 102 categories which have at least one document in the training set and the test set. The number of categories in each level is 25 top level, 33 second level, 43 third level, and 1 fourth level, respectively. Table 1 shows the number of documents in each top level category.</Paragraph>
      <Paragraph position="1"> After eliminating unlabelled documents, we obtained 271,171 documents. We divided these documents into two sets: a training set from 20th Aug. to 31th Oct. which consists of 150,939 documents, and test set from 1th Nov. to 31st Dec. which consists of 120,242 documents. We obtained a vocabulary of 183,400 unique words after eliminating words which occur only once, stemming by a part-of-speech tagger(Schmid, 1995), and stop word removal. Figure 3 illustrates the category distribution  We use ten-fold cross validation for learning NB parameters. For evaluating the effectiveness of category assignments, we use the standard recall, precision, and a189a191a190 measures. Recall denotes the ratio of correct assignments by the system divided by the total number of correct assignments. Precision is the ratio of correct assignments by the system divided by the total number of the system's assignments.</Paragraph>
      <Paragraph position="2"> The a189a98a190 measure which combines recall (a65 ) and pre-</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> The result is shown in Table 2.</Paragraph>
      <Paragraph position="1">  'NB', 'SVMs', and 'Manipulating data' denotes the result using Naive Bayes, SVMs classifiers, and our method, respectively. 'miR', 'miP', and 'miF1' refers to the micro-averaged recall, precision, and F1, respectively. 'all' in Table 2 shows the results of all 102 categories. The micro-averaged F1 score of our method in 'all' (0.704) is higher than the NB (0.519) and SVMs scores (0.285). We note that the F1 score of SVMs (0.285) is significantly lower than other models. This is because we could not obtain a classifier to judge the category 'corporate/industrial' in the top level within 10 days using a standard 2.4 GHz Pentium IV PC with 1,500 MB of RAM. We thus eliminated the category and its child categories from the 102 categories. The number of the remaining categories in each level is 24 top, 14 second, 29 third, and 1 fourth level. 'Parts' in Table 2 denotes the results. There is no significant difference between 'all' and 'parts' in our method, as the F1 score of 'all' was 0.704 and 'parts' was 0.700. The F1 of our method in 'parts' is also higher than the NB and SVMs scores.</Paragraph>
      <Paragraph position="2"> Table 3 denotes the amount of training data used to train NB and SVMs in our method and test data judged by each classifier. We can see that our method makes the computation of the SVMs more efficient, since the data trained by SVMs is only 23,243 from 150,939 documents.</Paragraph>
      <Paragraph position="3"> Table 4 illustrates the results of three methods according to each category level. 'Training' in 'Manipulating data' denotes the number of documents used to train SVMs. The overall F1 value of NB, SVMs, and our method for the 25 top-level cate- null gories is 0.693, 0.341, and 0.715, respectively. Classifying large corpora with similar categories is a difficult task, so we did not expect to have exceptionally high accuracy like Reuters-21578 (0.85 F1 score). Performance on the original training set using SVMs is 0.285 and using NB is 0.519, so this is a difficult learning task and generalization to the test set is quite reasonable.</Paragraph>
      <Paragraph position="4"> There is no significant difference between the overall F1 value of the second(0.608) and third level categories(0.606) in our method, while the accuracy of the other methods drops when classifiers select sub-branches, in third level categories. As Dumais et. al. mentioned, the results of our experiment show that performance varies widely across categories. The highest F1 score is 0.864 ('Commodity markets' category), and the lowest is 0.284 ('Economic performance' category).</Paragraph>
      <Paragraph position="5"> The overall F1 values obtained by three methods for the fourth level category ('Annual result') are low. This is because there is only one category in the level, and we thus used all of the training data, 150,939 documents, to learn models.</Paragraph>
      <Paragraph position="6"> The contribution of the hierarchical structure is best explained by looking at the results with and without category hierarchies, as illustrated in Table 5. It is interesting to note that the results of both NB and our method clearly demonstrate that incorporating category hierarchies into the classification method improves performance, whereas hierarchies degraded the performance of SVMs. This shows that the separation of one top level category(C) from the set of the other 24 top level categories is more difficult than separating C from the set of all the other 101 categories in SVMs.</Paragraph>
      <Paragraph position="7"> Table 6 illustrates sample words which have the highest weighted value calculated using formula (3).</Paragraph>
      <Paragraph position="8"> Recall that in SVMs each value of word a61 a100 (1 a158 a99a86a158 a1 ) is calculated using formula (3), and the larger value of a61 a100 is, the more the word a61 a100 features positive examples. Table 6 denotes the results of two binary classifiers. One is a classifier that separates documents assigned the 'Economics' category(positive examples) from documents assigned a set of the other 24 top level categories, i.e. 'hierarchy'. The other is a classifier that separates documents with the 'Economics' category from documents with a set of the other 101 categories, i.e., 'non-hierarchy'. Table 6 shows that in 'Nonhierarchy', words such as 'economic', 'economy' and 'company' which feature the category 'Economics' have a high weighted value, while in 'hierarchy', words such as 'year' and 'month' which do not feature the category have a high weighted value. Further research using various subsets of the top level categories is necessary to fully understand the influence of the hierarchical structure created by  humans.</Paragraph>
      <Paragraph position="9"> Economics Hierarchy access, Ford, Japan, Internet, economy, year, sale, service, month, marketNon-hierarchy economic, economy, industry, ltd., company, Hollywood, business, service, Internet, access  Finally, we compare our results with a well-known technique, a0a7a1a166a3a5a0a7a6a9a8a26a10a12a0 strategies. In the experiment using ensemble, we divided a training set into ten folds for each category level. Once the individual classifiers are trained by SVMs they are used to classify test data. Each classifier votes and the test data is assigned to the category that receives more than 6 votes3. The result is illustrated in Table 7. In Table 7, 'Non-hierarchy' and 'Hierarchy' denotes the result of the 102 categories treated as a flat non-hierarchical problem, and the result using hierarchical structure, respectively. We can find that the result of a0a7a1a166a3a37a0a2a6a9a8a26a10a12a0 with hierarchy(0.704 F1) outperforms the result with non-hierarchy(0.532 F1). A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are a14 a108a135a108a111a195 a65 a14 a167a26a0 and a94 a19a197a196 a0a2a65a57a3a5a0 (Hansen and Salamon, 1990). An accurate classifier is one that has an error rate better than random guessing on new test data. Two classifiers are diverse if they make different errors on new data points. Given our result, it may be safely said, at least regarding the Reuters 1996 corpus, that hierarchical structure is more effective for constructing ensembles, i.e., an ensemble of classifiers which are constructed by the training data with fewer than 30 categories in each level is more  schemes in the experiment.</Paragraph>
      <Paragraph position="10"> score) when we use hierarchical structure. However, the computation of the former is far more efficient than the latter. Furthermore, we see that our method (0.596 F1 score) slightly outperforms a0a7a1a166a3a5a0a7a6a9a8a26a10a12a0 (0.532 F1 score) when the 102 categories are treated as a flat non-hierarchical problem.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML