File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/h01-1052_concl.xml

Size: 1,460 bytes

Last Modified: 2025-10-06 13:53:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1052">
  <Title>Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing</Title>
  <Section position="8" start_page="5" end_page="5" type="concl">
    <SectionTitle>
6. CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> We have presented learning curves for a particular natural language disambiguation problem, confusion set disambiguation, training with more than a thousand times more data than had previously been used for this problem. We were able significantly reduce the error rate, compared to the best system trained on the standard training set size, simply by adding more training data.</Paragraph>
    <Paragraph position="1">  We assume an annotated corpus such as the Penn Treebank already exists, and our task is to significantly grow it.</Paragraph>
    <Paragraph position="2"> Therefore, we are only taking into account the marginal cost of additional annotated data, not start-up costs such as style manual design.</Paragraph>
    <Paragraph position="3"> We see that even out to a billion words the learners continue to benefit from additional training data.</Paragraph>
    <Paragraph position="4"> It is worth exploring next whether emphasizing the acquisition of larger training corpora might be the easiest route to improved performance for other natural language problems as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML