File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1130_evalu.xml
Size: 5,258 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1130"> <Title>Fine Grained Classification of Named Entities</Title> <Section position="6" start_page="3" end_page="3" type="evalu"> <SectionTitle> 5. Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.1 Experiment 1: Held out data </SectionTitle> <Paragraph position="0"> The results of the classifier on both the validation set and the held out test set can be seen in Figure 4. The results are presented for a classifier trained using the C4.5 algorithm both with and without MemRun (THRESH1=85, THRESH2=98). Also shown is the baseline score for each test set computed by always choosing the most frequent classification (Politician for both). It is clear from the figure that the classifiers for both test sets and for both conditions performed better than baseline. Also clear is that the MemRun algorithm significantly improves performance on both the validation and held out test sets.</Paragraph> <Paragraph position="1"> Figure 4 further shows a large discrepancy between the performance of the classifier on the two data sets. Expectedly, the validation set is classified more easily both with and without MemRun. The size of the discrepancy is a function of how different the distribution of the training set is from the true distribution of person instances in the world. While this discrepancy is undeniable, it is interesting to note how well the classifier generalizes given the very biased sample upon which it was trained.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.2 Experiment 2: Learning Algorithms </SectionTitle> <Paragraph position="0"> a validation set. Learners include: k-Nearest Neighbors, Naive Bayes, support vector machine, neural network, and C4.5 decision tree.</Paragraph> <Paragraph position="1"> Figure 5 shows the results of comparing different machine learning strategies. It is clear from the figure that all the algorithms perform better than the baseline score, while the C4.5 algorithm performs the best. This is not surprising as decision trees combine powerful aspects of non-linear separation and feature selection.</Paragraph> <Paragraph position="2"> Interestingly, however, there is no clear relationship between performance and the theoretical foundations of the classifier.</Paragraph> <Paragraph position="3"> Although the two top performers (decision tree and the neural network) are both non-linear classifiers, the linear SVM outperforms the non-linear k-Nearest Neighbors. This must, however, be taken with a grain of salt, as little was done to optimize either the k-NN or SVM implementation.</Paragraph> <Paragraph position="4"> Another interesting finding in recent work is an apparent relationship between classifier type and performance on held out data. While the non-parametric learners, i.e. C4.5 and k-NN, are fairly robust to generalization, the parametric learners, i.e. Naive Bayes and SVM, perform significantly worse on the new distribution. In future work, we intend to examine further this possible relationship.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.3 Experiment 3: Feature sets </SectionTitle> <Paragraph position="0"> The results of the feature set experiment can be seen in figure 6. Results are shown for the validation set using all combinations of the three feature sets. A baseline measure of always classifying the most frequent category (Politician) is also displayed.</Paragraph> <Paragraph position="1"> It is clear that each of the single feature sets (frequency features, topic signature features, and WordNet features) is sufficient to outperform the baseline. Interestingly, topic signature features outperform WordNet features, even though they are similar in form. This suggests that the WordNet features are noisy and may contain too much generality. It may be more appropriate to use a cutoff, such that only the concepts two levels above the term are examined. Another source of noise comes from words with multiple senses. Although our method uses only word senses of the appropriate part of speech, WordNet still often provides many different possible senses.</Paragraph> <Paragraph position="2"> feature sets. Results shown on validation set using C4.5 classifier without MemRun.</Paragraph> <Paragraph position="3"> Also of interest is the effect of combining any two feature sets. While using topic signatures and either word frequencies or WordNet features improves performance by a small amount, combining frequency and WordNet scores results in performance worse than WordNet alone. This suggests over fitting of the training data and may be due to the noise in the WordNet features.</Paragraph> <Paragraph position="4"> It is clear, however, that the combination of all three features provides considerable improvement in performance over any of the individual features. In future work we will examine how ensemble learning (Hastie, 2001) might be used to capitalize further on these qualitatively different feature sets.</Paragraph> </Section> </Section> class="xml-element"></Paper>