File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1010_intro.xml
Size: 7,418 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1010"> <Title>A Plethora of Methods for Learning English Countability</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Preliminaries </SectionTitle> <Paragraph position="0"> In this section, we describe the countability classes, the resources used in this research, and the feature extraction method. These are described in greater detail in Baldwin and Bond (2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Countability classes </SectionTitle> <Paragraph position="0"> Nouns are classified as belonging to one or more of four possible classes: countable, uncountable, plural only and bipartite. Countable nouns can be modified by denumerators, prototypically numbers, and have a morphologically marked plural form: one dog, two dogs. Uncountable nouns cannot be modified by denumerators, but can be modified by unspecific quantifiers such as much; they do not show any number distinction (prototypically being singular): *one equipment, some equipment, *two equipments. Plural only nouns only have a plural form, such as goods, and cannot be either denumerated or modified by much; many plural only nouns, such as clothes, use the plural form even as modifiers: a clothes horse. Bipartite nouns are plural when they head a noun phrase (trousers), but generally singular when used as a modifier (trouser leg); they can be denumerated with the classifier pair: a pair of scissors.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Gold standard data </SectionTitle> <Paragraph position="0"> Information about noun countability was obtained from two sources: COMLEX 3.0 (Grishman et al., 1998) and the common noun part of ALT-J/E's Japanese-to-English semantic transfer dictionary (Ikehara et al., 1991). Of the approximately 22,000 noun entries in COMLEX, 13,622 are marked as countable, 710 as uncountable and the remainder are unmarked for countability. ALT-J/E has 56,245 English noun types with distinct countability.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Feature space </SectionTitle> <Paragraph position="0"> Features used in this research are divided up into feature clusters, each of which is conditioned on the occurrence of a target noun in a given construction. Feature clusters are either one-dimensional (describing a single multivariate feature) or two-dimensional (describing the interaction between two multivariate features), with each dimension describing a lexical or syntactic property of the construction in question. An example of a one-dimensional feature cluster is head noun number, i.e. the number (singular or plural) of the target noun when it occurs as the head of an NP; an example of a two-dimensional feature cluster in subject-verb agreement, i.e. the number (singular or plural) of the target noun when it occurs as head of a subject NP vs. number agreement on the verb (singular or plural). Below, we provide a basic description of the 10 feature clusters used in this research and their dimensionality ([x]=1-dimensional feature cluster with x unit features, [xxy]=2-dimensional feature cluster with xxy unit features). These represent a total of 206 unit features.</Paragraph> <Paragraph position="1"> Head noun number:[] the number of the target noun when it heads an NP Modifier noun number:[] the number of the target noun when a modifier in an NP Subject-verb agreement:[x] the number of the target noun in a subject position vs. number agreement on the governing verb Coordinate noun number:[x] the number of the target noun vs. the number of the head nouns of conjuncts N of N constructions:[x] the type of the N (e.g.</Paragraph> <Paragraph position="2"> COLLECTIVE, TEMPORAL) vs. the number of the target noun (N) in an N of N construction Occurrence in PPs:[x] the preposition type vs.</Paragraph> <Paragraph position="3"> the presence or absence of a determiner when the target noun occurs in singular form in a PP Pronoun co-occurrence:[x] what personal, possessive and reflexive pronouns (e.g. he, their, itself ) occur in the same sentence as singular and plural instances of the target noun Singular determiners:[] what singular-selecting determiners (e.g. a, much) occur in NPs headed by the target noun in singular form Plural determiners:[] what plural-selecting determiners (e.g. many, various) occur in NPs headed by the target noun in plural form Non-bounded determiners:[x] what non-bounded determiners (e.g. more, sufficient) occur in NPs headed by the target noun, and what is the number of the target noun for each</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Feature extraction </SectionTitle> <Paragraph position="0"> The values for the features described above were extracted from the written component of the British National Corpus (BNC, Burnard (2000)) using three different pre-processors: (a) a POS tagger, (b) a full-text chunker and (c) a dependency parser. These are used independently to test the efficacy of the different systems at capturing features used in the classification process, and in tandem to consolidate the strengths of the individual methods.</Paragraph> <Paragraph position="1"> With the POS extraction method, we first tagged the BNC using an fnTBL-based tagger (Ngai and Florian, 2001) trained over the Brown and WSJ corpora and based on the Penn POS tagset. We then lemmatised this data using a Penn tagset-customised version of morph (Minnen et al., 2001). Finally, we implemented a range of high-precision, low-recall POS-based templates to extract out the features from the processed data.</Paragraph> <Paragraph position="2"> For the chunker, we ran fnTBL over the lemmatised tagged data, training over CoNLL 2000style (Tjong Kim Sang and Buchholz, 2000) chunkconverted versions of the full Brown and WSJ corpora. For the NP-internal features (e.g. determiners, head number), we used the noun chunks directly, or applied POS-based templates locally within noun chunks. For inter-chunk features (e.g. subject-verb agreement), we looked at only adjacent chunk pairs so as to maintain a high level of precision.</Paragraph> <Paragraph position="3"> We read dependency tuples directly off the output of RASP (Briscoe and Carroll, 2002b) in grammatical relation mode.1 RASP has the advantage that recall is high, although precision is potentially lower 1We used the first parse in the experiments reported here. An alternative method would be to use weighted dependency tuples, as described in Briscoe and Carroll (2002a).</Paragraph> <Paragraph position="4"> than chunking or tagging as the parser is forced into resolving phrase attachment ambiguities and committing to a single phrase structure analysis.</Paragraph> <Paragraph position="5"> After generating the different feature vectors for each noun based on the above configurations, we filtered out all nouns which did not occur at least 10 times in NP head position in the output of all three systems. This resulted in a total of 20,530 nouns, of which 9,031 are contained in the combined COM-LEX and ALT-J/E lexicons. The evaluation is based on these 9,031 nouns.</Paragraph> </Section> </Section> class="xml-element"></Paper>