File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1010_metho.xml
Size: 13,413 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1010"> <Title>A Plethora of Methods for Learning English Countability</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Feature representation </SectionTitle> <Paragraph position="0"> We test two basic feature representations in this research: distribution-based, which simply looks at the relative occurrence of different features in the corpus data, and agreement-based, which analyses the level of token-wise agreement between multiple systems.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Distribution-based feature representation </SectionTitle> <Paragraph position="0"> In the distribution-based feature representation, we take each target noun in turn and compare its amalgamated value for each unit feature with (a) the values for other target nouns, and (b) the value of other unit features within that same feature cluster. That is, we focus on the relative prominence of features globally within the corpus and locally within each feature cluster.</Paragraph> <Paragraph position="1"> In the case of a one-dimensional feature cluster (e.g. singular determiners), each unit feature f s for target noun w is translated into 3 separate feature values:</Paragraph> <Paragraph position="3"> where freq([?]) is the frequency of all words in the corpus. That is, for each unit feature we capture the relative corpus frequency, frequency relative to the target word frequency, and frequency relative to other features in the same feature cluster. Thus, for an n-valued one-dimensional feature cluster, we generate 3n independent feature values.</Paragraph> <Paragraph position="4"> In the case of a two-dimensional feature matrix (e.g. subject-position noun number vs. verb number agreement), each unit feature f s,t for target noun w is translated into corpfreq(f s,t,w), wordfreq(f s,t,w) and featfreq(f s,t,w) as above, and 2 additional feature values:</Paragraph> <Paragraph position="6"> which represent the featfreq values calculated along each of the two feature dimensions. Additionally, we calculate cumulative totals for each row and column of the feature matrix and describe each as for the one-dimensional features above (in the form of 3 values). Thus, for an m x n-valued two-dimensional feature cluster, we generate a total of 5mn+3(m+n) independent feature values.</Paragraph> <Paragraph position="7"> The feature clusters produce a combined total of 1284 individual feature values.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Agreement-based feature representation </SectionTitle> <Paragraph position="0"> The agreement-based feature representation considers the degree of token agreement between the features extracted using the three different preprocessors. This allows us to pinpoint the reliable diagnostics within the corpus data and filter out noise generated by the individual pre-processors.</Paragraph> <Paragraph position="1"> It is possible to identify the features which are positively-correlated with a unique countability class (e.g. occurrence of a singular noun with the determiner a occurs only for countable nouns), and for each to determine the token-level agreement between the different systems. The number of diagnostics considered for each of the countability classes is: 32 for countable nouns, 19 for uncountable nouns and 1 for each of plural only and bipartite nouns.</Paragraph> <Paragraph position="2"> The total number of diagnostics we test agreement across is thus 53.</Paragraph> <Paragraph position="3"> The token-level correlation for each feature f s is calculated fourfold according to relative agreement, the k statistic, correlated frequency and correlated weight. The relative agreement between systems sys and sys wrt f s for target noun w is defined to be:</Paragraph> <Paragraph position="5"> where tok(fs,w)(sysi) returns the set of token instances of (f s,w). The k statistic (Carletta, 1996) is recast as:</Paragraph> <Paragraph position="7"> In this modified form, k(fs,w) represents the divergence in relative agreement wrt f s for target noun w, relative to the mean relative agreement wrt f s over all words. Correlated frequency is defined to be:</Paragraph> <Paragraph position="9"> It describes the occurrence of tokens in agreement for (f s,w) relative to the total occurrence of the target word.</Paragraph> <Paragraph position="10"> The metrics are used to derive three separate feature values for each diagnostic over the three pre-processor system pairings. We additionally calculate the mean value of each metric across the system pairings and the overall correlated weight for each countability class C as:</Paragraph> <Paragraph position="12"> Correlated weight describes the occurrence of correlated features in the given countability class relative to other correlated features.</Paragraph> <Paragraph position="13"> We test agreement: (a) for each of these diagnostics individually and within each countability class (Agree(Token,[?])), and (b) across the amalgam of diagnostics for each of the countability classes (Agree(Class,[?])). For Agree(Token,[?]), we calculate agr, k and cfreq values for each of the 53 diagnostics across the 3 system pairings, and additionally calculate the mean value for each value. We additionally calculate the overall cw value for each countability class. This results in a total of 640 feature values (3x53x3 + 53x3 + 4). In the case of Agree(Class,[?]), we average the agr, k and cfreq values across each countability class for each of the three system pairings, and also calculate the mean value in each case. We further calculate the overall cw value for each countability class, culminating in</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classifier Set-up and Evaluation </SectionTitle> <Paragraph position="0"> Below, we outline the different classifiers tested and describe the process used to generate the gold-standard data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Classifier architectures </SectionTitle> <Paragraph position="0"> We propose a variety of unsupervised and supervised classifier architectures for the task of learning countability, and also a feature selection method. In all cases, our classifiers are built using TiMBL version 4.2 (Daelemans et al., 2002), a memory-based classification system based on the k-nearest neighbour algorithm. As a result of extensive parameter optimisation, we settled on the default configuration2 for TiMBL with k set to 9.3 2IB1 with weighted overlap, gain ratio-based feature weighting and equal weighting of neighbours.</Paragraph> <Paragraph position="1"> 3We additionally experimented with the kernel-based TinySVM system, but found TiMBL to be the marginally superior performer in all cases, a somewhat surprising result given the high-dimensionality of the feature space.</Paragraph> <Paragraph position="2"> Full-feature supervised classifiers The simplest system architecture applies the supervised learning paradigm to the distribution-based feature vectors for each of the POS tagger, chunker and RASP (Dist(POS,[?]), Dist(chunk,[?]) and Dist(RASP,[?]), respectively). For the distribution-based feature representation, we additionally combine the outputs of the three pre-processors by: (a) concatenating the individual distribution-based feature vectors for the three systems (resulting in a 3852-element feature vector: Dist(AllCON,[?])); and (b) taking the mean over the three systems for each distribution-based feature value (resulting in a 1284-element feature vector: Dist(AllMEAN,[?])).</Paragraph> <Paragraph position="3"> The agreement-based feature representation provides two additional system configurations: Agree(Class,[?]) and Agree(Token,[?]) (see Section 3.2).</Paragraph> <Paragraph position="4"> Orthogonal to the issue of how to generate the feature values is the question of how to classify a given noun according to the different countability classes. The two basic options here are to either have a single classifier and define multiclasses according to all observed combinations of countability classes (Dist([?],SINGLE)), or have a suite of binary classifiers, one for each countability class (Dist([?],SUITE)). The SINGLE classifier architecture has advantages in terms of speed (a 4x speed-up over the classifier suite) and simplicity, but runs into problems with data sparseness for the lesscommonly attested multi-classes given that a single noun can occur with multiple countabilities. The SUITE classifier architecture delineates the different countability classes more directly, but runs the risk of a noun not being classified according to any of the four classes.</Paragraph> <Paragraph position="5"> Feature-selecting supervised classifiers We improve the performance of the basic classifiers by way of best-N filter-based feature selection. Feature selection has been shown to improve classification accuracy over a variety of tasks (Liu and Motoda, 1988), but in the case of memory-based learners such as TiMBL, has the additional advantage of accelerating the classification process and reducing memory overhead. The computational complexity of memory-based learners is proportional to the number of features, so any reduction in the feature space leads to a proportionate reduction in computational time. For tasks such as countability classification with a large number of both feature values and test instances (particularly if we are to classify all noun types in a given corpus), this speed-up is vital.</Paragraph> <Paragraph position="6"> Our feature selection method uses a combined feature relevance metric to estimate the best-N features for each countability class, and then restricts the classifier to operate over only those N features.</Paragraph> <Paragraph position="7"> Feature relevance is estimated through analysis of the correspondence between class and feature values for a given feature, through metrics including shared variance and information gain. These individual metrics tend to be biased toward particular features: information gain and gain ratio, e.g., tend to favour features of higher cardinality (White and Liu, 1994). In order to minimise such bias, we generate a feature ranking for each feature selection metric (based on the relative feature relevance scores), and simply add the absolute ranks for each feature together. By re-ranking the features in increasing order of summed rank, we can generate a generalised feature relevance ranking. We are now in a position to prune the feature space to a pre-determined size, by taking the best-N features in the feature ranking.</Paragraph> <Paragraph position="8"> The feature selection metrics we combine are those implemented in TiMBL, namely: shared variance, chi-square, information gain and gain ratio.</Paragraph> <Paragraph position="9"> Unsupervised classifier In order to derive a common baseline for the different systems, we built an unsupervised classifier which, for each target noun, simply checks to see if any diagnostic (as used in the agreement-based feature representation) was detected for each of the countability classes; even a single occurrence of a diagnostic is taken to be sufficient evidence for membership in that countability class. Elementary system combination is achieved by voting between the three pre-processor outputs as to whether the target noun belongs to a given countability class. That is, the target noun is classified as belonging to a given countability class iff at least two of the pre-processors furnish linguistic evidence for membership in that class.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Training data </SectionTitle> <Paragraph position="0"> Training data was generated independently for the SINGLE and SUITE classifiers. In each case, we first extracted all countability-annotated nouns from each of the ALT-J/E and COMLEX lexicons which are attested at least 10 times in the BNC, and composed the training data from these pre-filtered sets. In the case of the SINGLE classifier, we simply classified words according to the union of all countabilities from ALT-J/E and COMLEX, resulting in the following dataset: Count Uncount Plural Bipart No. Freq</Paragraph> <Paragraph position="2"> From this, it is evident that some class combinations (e.g. plural only+bipartite) are highly infrequent, hinting at a problem with data sparseness.</Paragraph> <Paragraph position="3"> For the SUITE classifier, we generate the positive exemplars for the countable and uncountable classes from the intersection of the COMLEX and ALT-J/E data for that class; negative exemplars, on the other hand, are those not annotated as belonging to that class in either lexicon. With the plural only and bipartite data, COMLEX cannot be used as it does not describe these two classes. We thus took all members of each class listed in ALT-J/E as our positive exemplars, and all remaining nouns with non-identical singular and plural forms as negative exemplars. This resulted in the following datasets:</Paragraph> </Section> </Section> class="xml-element"></Paper>