File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1052_intro.xml
Size: 3,048 bytes
Last Modified: 2025-10-06 14:01:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1052"> <Title>Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> A significant amount of work in empirical natural language processing involves developing and refining machine learning techniques to automatically extract linguistic knowledge from on-line text corpora. While the number of learning variants for various problems has been increasing, the size of training sets such learning algorithms use has remained essentially unchanged.</Paragraph> <Paragraph position="1"> For instance, for the much-studied problems of part of speech tagging, base noun phrase labeling and parsing, the Penn Treebank, first released in 1992, remains the de facto training corpus. The average training corpus size reported in papers published in the ACL-sponsored Workshop on Very Large Corpora was essentially unchanged from the 1995 proceedings to the 2000 proceedings. While the amount of available on-line text has been growing at an amazing rate over the last five years (by some estimations, there are currently over 500 billion readily accessible words on the web), the size of training corpora used by our field has remained static.</Paragraph> <Paragraph position="2"> Confusable word set disambiguation, the problem of choosing the correct use of a word given a set of words with which it is commonly confused, (e.g. {to, too, two}, {your, you're}), is a prototypical problem in NLP. At some level, this task is identical to many other natural language problems, including word sense disambiguation, determining lexical features such as pronoun case and determiner number for machine translation, part of speech tagging, named entity labeling, spelling correction, and some formulations of skeletal parsing. All of these problems involve disambiguating from a relatively small set of tokens based upon a string context. Of these disambiguation problems, lexical confusables possess the fortunate property that supervised training data is free, since the differences between members of a confusion set are surface-apparent within a set of well-written text.</Paragraph> <Paragraph position="3"> To date, all of the papers published on the topic of confusion set disambiguation have used training sets for supervised learning of less than one million words. The same is true for most if not all of the other disambiguation-in-string-context problems. In this paper we explore what happens when significantly larger training corpora are used. Our results suggest that it may make sense for the field to concentrate considerably more effort into enlarging our training corpora and addressing scalability issues, rather than continuing to explore different learning methods applied to the relatively small extant training corpora.</Paragraph> </Section> class="xml-element"></Paper>