File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1010_evalu.xml
Size: 7,098 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1010"> <Title>A Plethora of Methods for Learning English Countability</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Evaluation of the supervised classifiers was carried out based on 10-fold stratified cross-validation over the relevant dataset, and results presented here are averaged over the 10 iterations. Classifier performance is rated according to classification accuracy (the proportion of instances classified correctly) and F-score (b = 1). In the case of the SINGLE classifier, the class-wise F-score is calculated by decomposing the multiclass labels into their components. A countable+uncountable instance misclassified as countable, for example, would count as a misclassification in terms of classification accuracy, a correct classification in the calculation of the countable F-score, and a misclassification in the calculation of the uncountable F-score. Note that the SINGLE classifier is run over a different dataset to each member of the SUITE classifier, and cross-comparison of the classification accuracies is not representative of the relative system performance (classification accuracies for the SINGLE classifier are given in parentheses to reinforce this point). Classification accuracies are thus simply used for classifier comparison within a basic classifier architecture (SINGLE or SUITE), and F-score is the evaluation metric of choice for overall evaluation. null We present the results for two baseline systems for each countability class: a majority-class classifier and the unsupervised method. The Majority class system is run over the binary data used by the SUITE classifier for the given class, and simply classifies all instances according to the most commonly-attested class in that dataset. Irrespective of the majority class, we calculate the F-score based on a positive-class classifier, i.e. a classifier which naively classifies each instance as belonging to the given class; in the case that the positive class is not the majority class, the F-score is given in parentheses. null The results for the different system configurations over the four countability classes are presented in Tables 1-4, in which the highest classification accuracy and F-score values for each class are presented in boldface. The classifier Dist(AllCON,SUITE), for example, applies the distribution-based feature representation in a SUITE classifier configuration (i.e. it tests for binary membership in each countability class), using the concatenated feature vectors from each of the tagger, chunker and RASP.</Paragraph> <Paragraph position="1"> Items of note in the results are: was, without exception, superior to the best of the agreement-based classifiers * chunk-based feature extraction generally produced superior performance to POS tag-based feature extraction, which was in turn generally better than RASP-based feature extraction; statistically significant differences in F-score (based on the two-tailed t-test, p < .05) were observed for both chunking and tagging over RASP for the plural only class, and chunking over RASP for the countable class * for the SUITE classifier, system combination by either concatenation (Dist(AllCON,SUITE)) or averaging over the individual feature values (Dist(AllMEAN,SUITE)) generally led to a statistically significant improvement over each of the individual systems for the countable and uncountable classes,4 but there was no statistical difference between these two architectures for any of the 4 countability classes; for the SINGLE classifier, system combination (Dist(AllCON,SUITE)) did not lead to a significant performance gain To evaluate the effects of feature selection, we graphed the F-score value and processing time (in instances processed per second5) over values of N from 25 to the full feature set. We targeted the Dist(AllCON,SUITE) system for evaluation (3852 features), and ran it over both the countable and uncountable classes.6 We additionally carried out random feature selection as a baseline to compare the feature selection results against. Note that the x-axis (N) and right y-axis (instances/sec) are both logarithmic, such that the linear right-decreasing time curves are indicative of the direct proportionality between the number of features and processing time.</Paragraph> <Paragraph position="2"> The differential in F-score for the best-N configuration as compared to the full feature set is statistically insignificant for N > 100 for countable nouns and N > 50 for uncountable nouns. That is, feature selection facilitates a relative speed-up of around 30x without a significant drop in F-score. Comparing the results for the best-N and rand-N features, the difference in F-score was statistically significant for all values of N < 1000. The proposed method of feature selection thus allows us to maintain the full classification potential of the feature set while enabling a speedup greater than an order of magnitude, potentially making the difference in practical utility for the proposed method.</Paragraph> <Paragraph position="3"> To determine the relative impact of the component feature values on the performance of the distribution-based feature representation, we used the Dist(AllMEAN,SUITE) configuration to build: (a) a classifier using a single binary value for each unit feature, based on simple corpus occurrence (Binary); and (b) 3 separate classifiers based on each of the corpfreq, wordfreq and featfreq features values only (without the 2D feature cluster totals). In each case, the total number of feature values is 206.</Paragraph> <Paragraph position="4"> The results for each of these classifiers over countable and uncountable nouns are presented in Table 5, as compared to the basic Dist(AllMEAN,SUITE) classifier with all 1,284 features (All features) and also the best-200 features (Best-200). Results which differ from those for All features to a level of statistical significance are asterisked. The binary classifiers performed significantly worse than All features for both countable and uncountable nouns, underlining the utility of the distribution-based feature representation. wordfreq is marginally superior to corpfreq as a standalone feature representation, and both of these were on the whole slightly below the full feature set in performance (although no significant difference was observed). featfreq performed slightly worse again, significantly below the level of the full feature set. Results for the best-200 classifier were marginally higher than those for each of the individual feature representations in the case of the countable class, but marginally below the results for corpfreq and wordfreq in the case of the uncountable class. The differences here are not statistically significant, and additional evaluation is required to determine the relative success of feature selection over simply using wordfreq values, for example.</Paragraph> </Section> class="xml-element"></Paper>