File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2408_evalu.xml
Size: 4,215 bytes
Last Modified: 2025-10-06 13:59:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2408"> <Title>Modeling Category Structures with a Kernel Function</Title> <Section position="7" start_page="0" end_page="64" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> Through experiments of text categorization, we empirically compare the HP-TOP kernel with the linear kernel and the PLSI-based Fisher kernel. We use Reuters-21578 dataset2 with ModApte-split (Dumais et al., 1998). In addition, we delete some texts from the result of ModAptesplit, because those texts have no text body. After the deletion, we obtain 8815 training examples and 3023 test examples. The words that occur less than five times in the whole training set are excluded from the original feature set.</Paragraph> <Paragraph position="1"> We do not use all the 8815 training examples. The size of the actual training data ranges from 1000 to 8000. For each dataset size, experiments are executed 10 times with different training sets.The result is evaluated with F-measures for the most frequent 10 categories (Table 1).</Paragraph> <Paragraph position="2"> The total number of categories is actually 116. However, for small categories, reliable statistics cannot be obtained. For this reason, we regard the remaining categories other than the 10 most frequent categories as one category. Therefore, the model for negative examples is a mixture of 10 component models (9 out of the 10 most frequent categories and the new category consisting of the remaining categories).</Paragraph> <Paragraph position="3"> We assume uniform priors for categories as in (Tsuda et al., 2002). We computed the Fisher kernels with different numbers (10, 20 and 30) of latent classes and added them together to make a robust kernel (Hofmann, 2000).</Paragraph> <Paragraph position="4"> After the learning in the original feature space, the parameters for the probability distributions are estimated with maximum likelihood estimation as in Equations (19) and (20), followed by the learning with the proposed kernel.</Paragraph> <Paragraph position="5"> We used an SVM package, TinySVM3, for SVM computation. The soft-margin parameter C was set to 1.0 (other values of C showed no significant changes in results). null The result is shown in Figure 1 (for macro-average) and Figure 2 (for micro-average). The HP-TOP kernel outperforms the linear kernel and the PLSI-based Fisher kernel for every number of examples.</Paragraph> <Paragraph position="6"> At each number of examples, we conducted a Wilcoxon Signed Rank test with 5% significance-level, for the HP-TOP kernel and the linear kernel, since these two are better than the other. The test shows that the difference between the two methods is significant for the training data sizes 1000 to 5000. The superiority of the HP-TOP kernel for small training datasets supports our expectation that the enrichment of feature set will lead to better performance for few active words. Although we also expected that the effect of word sense disambiguation would improve accuracy for large training datasets, the experiments do not provide us with an empirical evidence for the expectation. One possible reason is that Gaussian-type functions do not reflect the actual distribution of data. We leave its further investigation as future research.</Paragraph> <Paragraph position="7"> In this experimental setting, the PLSI-based Fisher kernel did not work well in terms of categorization accuracy. However, this Fisher kernel will perform better when the number of labeled examples is small and a number of unlabeled examples are available, as reported by Hofmann (2000).</Paragraph> <Paragraph position="8"> We also measured computational time of each method (Figure 3). The vertical axis indicates the average computational time over 100 runs of experiments (10 runs for each category). Please note that training time in this fig- null ure does not include the computational time required for feature extraction4. This result empirically shows that the HP-TOP kernel outperforms the PLSI-based Fisher kernel in terms of computational time as theoretically expected in Section 5.3.</Paragraph> </Section> class="xml-element"></Paper>