File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2409_evalu.xml
Size: 10,368 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2409"> <Title>A Comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora</Title> <Section position="7" start_page="1" end_page="3" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5.1 Data and Evaluation Methodology </SectionTitle> <Paragraph position="0"> We compare automatically created hierarchy with flat and manually constructed hierarchy with respect to classification accuracy. We further evaluate the generated category hierarchy from two perspectives: we examine (i) whether or not the number of categories effects to construct a category hierarchy, and (ii) whether or not a large collection of data helps to generate a category hierarchy.</Paragraph> <Paragraph position="1"> The data we used is the 1996 Reuters corpus which is available lately(Reuters, 2000). The corpus from 20th Aug., 1996 to 19th Aug., 1997 consists of 806,791 documents. These documents are organized into 126 topical categories with a fifth level hierarchy. After eliminating unlabeled documents, we divide these documents into four sets. Table 1 illustrates each data which we used in each model, i.e. a flat non-hierarchical model, manually constructed hierarchy, and automatically created hierarchy. The same notation of (X) in Table 1 denotes a pair of training and test data. For example, '(F1) Training data' shows that 145,919 samples are used for training NB classifiers, and '(F1) Test data' illustrates that 290,665 samples are used for classification. We selected 102 categories which have at least one document in each data.</Paragraph> <Paragraph position="2"> We obtained a vocabulary of 320,935 unique words after eliminating words which occur only once, stemming by a part-of-speech tagger(Schmid, 1995), and stop word removal. The number of categories per document is 3.21 on average. For both of the hierarchical and non-hierarchical cases, we select the 1,000 features with the largest MI for each of the 102 categories, and create CRCPD8CTCVD3D6DD DACTCRD8D3D6.</Paragraph> <Paragraph position="3"> Like Roy et al's method, we use CQCPCVCVCXD2CV to reduce variance of the true output distribution C8B4DD CY DCB5. From our original development training set, 300,000 documents, a different training set which consists of 200,000 documents is created by random sampling. The learner then creates a new NB classifier from this sample. This procedure is repeated 10 times, and the final class posterior for an instance is taken to be the average of the class posteriors for each of the classifiers.</Paragraph> <Paragraph position="4"> For evaluating the effectiveness of category assignments, we use the standard recall, precision, and F-score. Recall is defined to be the ratio of correct assignments by the system divided by the total number of correct assignments. Precision is the ratio of correct assignments by the system divided by the total number of the system's assignments. The F-score which combines recall (D6) and precision (D4) with an equal weight is FB4D6BND4B5BP</Paragraph> <Paragraph position="6"> use micro-averaging F score where it computes globally over D2(all the number of categories) A2 D1 (the number of total test documents) binary decisions.</Paragraph> </Section> <Section position="2" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 5.2 Results and Discussion </SectionTitle> <Paragraph position="0"> Table 2 shows a top level of the hierarchy which is manually constructed. Table 3 shows a portion of the automatically generating hierarchy which is associated with the categories shown in Table 2. 'DC-DD-A1A1A1' shows the ID number which is assigned to each node of the tree. For example, 3-5-1 shows that the ID number of the top, second, and third level is 3, 5, and 1, respectively. AI shows a threshold value obtained by the training samples and development test samples.</Paragraph> <Paragraph position="1"> Tables 2 and 3 indicate that the automatically constructed hierarchy has different properties from manually created hierarchies. When the top level categories of hierarchical structure are equally CSCXD7CRD6CXD1CXD2CPD8CXD2CV properties, they are useful for text classification. In the manual construction of hierarchy(Reuters, 2000), there are 25 categories in the top level, while the result of our method based on corpus statistics shows that the top 4 frequent categories are selected as a discriminative properties, and other categories are sub-categorised into 'Government/social' except for 'Labour issues' and 'Weather'. In the automatically generating hierarchy, 'Economics' AX 'Expenditure'AX 'Welfare', and 'Economics' AX 'Labour issues' are created, while 'Welfare' and 'Labour issues' belong to the top level in the manually constructed hierarchy. null Another interesting feature of our result is that some of the related categories are merged into one cluster, while in the manual hierarchy, they are different locations. Table 4 illustrates the sample result of related categories in the automatically created hierarchy. In Table 4, for example, 'Ec competition/subsidy' is sub-categorised by 'Monopolies/competition'. In a similar way, 'E31', 'E311', 'E143', and 'E132' are classified into 'MCAT', since these categories are related to market news.</Paragraph> <Paragraph position="2"> As just described, thresholds for each level of a hierarchy were established on the training samples and development test samples. We then use these thresholds for text classification, i.e. for each level of a hierarchy, if a test sample exceeds the threshold, we assigned the category to the test sample. A test sample can be in zero, one, or more than one categories.</Paragraph> <Paragraph position="3"> Table 5 shows the result of classification accuracy.</Paragraph> <Paragraph position="4"> 'Flat' and 'Manual' shows the baseline, i.e. the result for all 102 categories are treated as a flat non-hierarchical problem, and the result using manually constructed hierarchy, respectively. 'Automatic' denotes the result of our method. 'miR', 'miP', and 'miF' refers to the micro-averaged recall, precision, and F-score, respectively. Table 5 shows that the overall F values obtained by our method was 3.9% better than the Flat model, and 3.3% better than the Manual model. Both results are sta- null tistically significant using a micro sign test, P-value AK .01(Yang and Liu, 1999). Somewhat surprisingly, there is no difference between Flat and Manual, since a micro sign test, P-value BQ 0.05. This shows that manual construction of hierarchy which depends on a corpus is a difficult task. The overall F value of our method is 0.734. Classifying large data with similar categories is a difficult task, so we did not expect to have exceptionally high accuracy like Reuters-21578 (the performance over 0.85 F-score(Yang and Liu, 1999)). Performance on the closed data, i.e. training samples and development test samples in 'Flat', 'Manual', and 'Automatic' was 0.705, 0.720, and 0.782, respectively. Therefore, this is a difficult learning task and generalization to the test set is quite reasonable.</Paragraph> <Paragraph position="5"> Table 6 and Table 7 illustrates the results at each hierarchical level of manually constructed hierarchies, and our method, respectively. 'Clusters' denotes the number of clusters, and 'Categories' refers to the number of categories at each level. The F-score of 'Manual' for the top level categories is 0.744, and that of our method is 0.919. They outperform the flat model. However, the performance by both 'Manual' and our method monotonically decreases when the depth from the top level to each node is large, and the overall F-score at the lower level of hierarchies is very low. This is because at the lower level more similar or the same features could be used as features within the same top level category. This suggests that we should be able to obtain further advantages in efficiency in the hierarchical approach by reducing the number of features which are not useful discriminators within the lower-level of hierarchies(Koller and Sahami, 1997).</Paragraph> <Paragraph position="6"> Figure 2 shows the result of classification accuracy using different number of categories, i.e. 10, 50 and 102 categories. Each set of 10 and 50 categories from training samples is created by random sampling. The sampling is repeated 10 times . Each point in Figure 2 is the average performance over 10 sets.</Paragraph> <Paragraph position="7"> and manually constructed hierarchy at every point in the graph. As can be seen, the grater the number of categories, the more likely it is that a test sample has been incorrectly classified. Specifically, the performance of flat model using 10 categories was 0.720 F-score and that of 50 categories was 0.695. This drop of accuracy indicates that flat model is likely to be difficult to train when there are a large number of classes with a large number of features. null In our method, 10 hierarchies are constructed for each set of categories.</Paragraph> <Paragraph position="8"> Figure 3 shows the result of classification accuracy using the different size of training samples, i.e. 10,000, 100,000 and 145,919 samples . Each set of samples is created by random sampling except for the set of 145,919 samples. The sampling process is repeated 10 times. The average accuracy of each result across the 10 sets of samples is reported in Figure 3.</Paragraph> <Paragraph position="9"> The performance can benefit significantly from much larger training samples. At the number of training samples is 10,000, all methods have poor effectiveness, but it learns rapidly, especially, the results of flat model shows that it is extremely sensitive to the amount of training data. At every point in the graph, the result of our method with cluster-based generalisations outperforms other two methods, especially, the method becomes more attractive with less training samples.</Paragraph> <Paragraph position="10"> We used the same number of development, development test, and test samples which are shown in Table1 in the experiment. null</Paragraph> </Section> </Section> class="xml-element"></Paper>