File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1059_evalu.xml
Size: 7,058 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1059"> <Title>Learning the Countability of English Nouns from Corpus Data</Title> <Section position="7" start_page="3" end_page="3" type="evalu"> <SectionTitle> 5 Results and Evaluation </SectionTitle> <Paragraph position="0"> Evaluation is broken down into two components.</Paragraph> <Paragraph position="1"> First, we determine the optimal classifier configuration for each countability class by way of stratified cross-validation over the gold-standard data. We then run each classifier in optimised configuration over the remaining target nouns for which we have feature vectors.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.1 Cross-validated results </SectionTitle> <Paragraph position="0"> First, we ran the classifiers over the full feature set for the three feature extraction methods. In each case, we quantify the classifier performance by way of 10-fold stratified cross-validation over the gold-standard data for each countability class. The final classification accuracy and F-score3 are averaged over the 10 iterations.</Paragraph> <Paragraph position="1"> The cross-validated results for each classifier are presented in Table 3, broken down into the different feature extraction methods. For each, in addition to the F-score and classification accuracy, we present the relative error reduction (e.r.) in classification accuracy over the majority-class baseline for that gold-standard set (see Table 2). For each countability class, we additionally ran the classifier over the concatenated feature vectors for the three basic feature extraction methods, producing a 3,852-value feature space (&quot;Combined&quot;).</Paragraph> <Paragraph position="2"> Given the high baseline classification accuracies for each gold-standard dataset, the most revealing statistics in Table 3 are the error reduction and F-score values. In all cases other than bipartite, the combined system outperformed the individual systems. The difference in F-score is statistically significant (based on the two-tailed t-test, p < :05) for the asterisked systems in Table 3. For the bipartite class, the difference in F-score is not statistically significant between any system pairing.</Paragraph> <Paragraph position="3"> There is surprisingly little separating the tagger-, chunker- and RASP-based feature extraction methods. This is largely due to the precision/recall trade-off noted above for the different systems.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 5.2 Open data results </SectionTitle> <Paragraph position="0"> We next turn to the task of classifying all unseen common nouns using the gold-standard data and the best-performing classifier configurations for each Here, the baseline method is to classify every noun as being uniquely countable.</Paragraph> <Paragraph position="1"> There were 11,499 feature-mapped common nouns not contained in the union of the gold-standard datasets. Of these, the classifiers were able to classify 10,355 (90.0%): 7,974 (77.0%) as countable (e.g. alchemist), 2,588 (25.0%) as uncountable (e.g. ingenuity), 9 (0.1%) as bipartite (e.g. headphones), and 80 (0.8%) as plural only (e.g. damages). Only 139 nouns were assigned to multiple countability classes.</Paragraph> <Paragraph position="2"> We evaluated the classifier outputs in two ways. In the first, we compared the classifier output to the combined COMLEX and ALT-J/E lexicons: a lexicon with countability information for 63,581 nouns. The classifiers found a match for 4,982 of the nouns. The predicted countability was judged correct 94.6% of the time. This is marginally above the level of match between ALT-J/E and COMLEX (93.8%) and substantially above the baseline of all-countable at 89.7% (error reduction = 47.6%).</Paragraph> <Paragraph position="3"> To gain a better understanding of the classifier performance, we analysed the correlation between corpus frequency of a given target noun and its precision/recall for the countable class.5 To do this, we listed the 11,499 unannotated nouns in increasing order of corpus occurrence, and worked through the ranking calculating the mean precision and recall over each partition of 500 nouns. This resulted in the precision-recall graph given in Figure 1, from which it is evident that mean recall is proportional and precision inversely proportional to corpus fre4In each case, the classifier is run over the best500 features as selected by the method described in Baldwin and Bond (2003) rather than the full feature set, purely in the interests of reducing processing time. Based on cross-validated results over the training data, the resultant difference in performance is not statistically significant.</Paragraph> <Paragraph position="4"> 5We similarly analysed the uncountable class and found the same basic trend.</Paragraph> <Paragraph position="5"> quency. That is, for lower-frequency nouns, the classifier tends to rampantly classify nouns as countable, while for higher-frequency nouns, the classifier tends to be extremely conservative in positively classifying nouns. One possible explanation for this is that, based on the training data, the frequency of a noun is proportional to the number of countability classes it belongs to. Thus, for the more frequent nouns, evidence for alternate countability classes can cloud the judgement of a given classifier.</Paragraph> <Paragraph position="6"> In secondary evaluation, the authors used BNC corpus evidence to blind-annotate 100 randomly-selected nouns from the test data, and tested the correlation with the system output. This is intended to test the ability of the system to capture corpus-attested usages of nouns, rather than independent lexicographic intuitions as are described in the COM-LEX and ALT-J/E lexicons. Of the 100, 28 were classified by the annotators into two or more groups (mainly countable and uncountable). On this set, the baseline of all-countable was 87.8%, and the classifiers gave an agreement of 92.4% (37.7% e.r.), agreement with the dictionaries was also 92.4%.</Paragraph> <Paragraph position="7"> Again, the main source of errors was the classifier only returning a single countability for each noun. To put this figure in proper perspective, we also hand-annotated 100 randomly-selected nouns from the training data (that is words in our combined lexicon) according to BNC corpus evidence.</Paragraph> <Paragraph position="8"> Here, we tested the correlation between the manual judgements and the combined ALT-J/E and COMLEX dictionaries. For this dataset, the baseline of all-countable was 80.5%, and agreement with the dictionaries was a modest 86.8% (32.3% e.r.). Based on this limited evaluation, therefore, our automated method is able to capture corpus-attested countabilities with greater precision than a manually-generated static repository of countability data.</Paragraph> </Section> </Section> class="xml-element"></Paper>