File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1005_intro.xml

Size: 2,117 bytes

Last Modified: 2025-10-06 14:01:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1005">
  <Title>Scaling to Very Very Large Corpora for Natural Language Disambiguation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Machine learning techniques, which automatic ally learn linguistic information from online text corpora, have been applied to a number of natural language problems throughout the last decade. A large percentage of papers published in this area involve comparisons of different learning approaches trained and tested with commonly used corpora.</Paragraph>
    <Paragraph position="1"> While the amount of available online text has been increasing at a dramatic rate, the size of training corpora typically used for learning has not. In part, this is due to the standardization of data sets used within the field, as well as the potentially large cost of annotating data for those learning methods that rely on labeled text. The empirical NLP community has put substantial effort into evaluating performance of a large number of machine learning methods over fixed, and relatively small, data sets. Yet since we now have access to significantly more data, one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained using much larger corpora.</Paragraph>
    <Paragraph position="2"> In this paper, we present a study of the effects of data size on machine learning for natural language disambiguation. In particular, we study the problem of selection among confusable words, using orders of magnitude more training data than has ever been applied to this problem. First we show learning curves for four different machine learning algorithms.</Paragraph>
    <Paragraph position="3"> Next, we consider the efficacy of voting, sample selection and partially unsupervised learning with large training corpora, in hopes of being able to obtain the benefits that come from significantly larger training corpora without incurring too large a cost.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML