File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/01/h01-1052_relat.xml

Size: 3,088 bytes

Last Modified: 2025-10-06 14:15:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1052">
  <Title>Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing</Title>
  <Section position="4" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2. PREVIOUS WORK
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Confusion Set Disambiguation
</SectionTitle>
      <Paragraph position="0"> Several methods have been presented for confusion set disambiguation. The more recent set of techniques includes multiplicative weight-update algorithms [4], latent semantic analysis [7], transformation-based learning [8], differential grammars [10], decision lists [12], and a variety of Bayesian classifiers [2,3,5]. In all of these papers, the problem is formulated as follows: Given a specific confusion set (e.g. {to, two, too}), all occurrences of confusion set members in the test set are replaced by some marker. Then everywhere the system sees this marker, it must decide which member of the confusion set to choose. Most learners that have been applied to this problem use as features the words and part of speech tags appearing within a fixed window, as well as collocations surrounding the ambiguity site; these are essentially the same features as those used for the other disambiguation-in-string-context problems.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Learning Curves for NLP
</SectionTitle>
      <Paragraph position="0"> A number of learning curve studies have been carried out for different natural language tasks. Ratnaparkhi [12] shows a learning curve for maximum-entropy parsing, for up to roughly one million words of training data; performance appears to be asymptoting when most of the training set is used. Henderson [6] showed similar results across a collection of parsers.</Paragraph>
      <Paragraph position="1"> Figure 1 shows a learning curve we generated for our task of word-confusable disambiguation, in which we plot test classification accuracy as a function of training corpus size using a version of winnow, the best-performing learner reported to date for this well-studied task [4]. This curve was generated by training on successive portions of the 1-million word Brown corpus and then testing on 1-million words of Wall Street Journal text for performance averaged over 10 confusion sets. The curve might lead one to believe that only minor gains are to be had by increasing the size of training corpora past 1 million words.</Paragraph>
      <Paragraph position="2"> While all of these studies indicate that there is likely some (but perhaps limited) performance benefit to be obtained from increasing training set size, they have been carried out only on relatively small training corpora. The potential impact to be felt by increasing the amount of training data by any signifcant order has yet to be studied.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML