File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1052_metho.xml

Size: 13,173 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1052">
  <Title>Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing</Title>
  <Section position="5" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3. EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> This work attempts to address two questions - at what point will learners cease to benefit from additional data, and what is the nature of the errors which remain at that point. The first question impacts how best to devote resources in order to improve natural language technology. If there is still much to be gained from additional data, we should think hard about ways to effectively increase the available training data for problems of interest. The second question allows us to study failures due to inherent weaknesses in learning methods and features rather than failures due to insufficient data.</Paragraph>
    <Paragraph position="1"> Since annotated training data is essentially free for the problem of confusion set disambiguation, we decided to explore learning curves for this problem for various machine learning algorithms, and then analyze residual errors when the learners are trained on all available data. The learners we used were memory-based learning, winnow, perceptron,  transformation-based learning, and decision trees. All learners used identical features  and were used out-of-the-box, with no parameter tuning. Since our point is not to compare learners we have refrained from identifying the learners in the results below.</Paragraph>
    <Paragraph position="2"> We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. Using this collection, which is three orders of magnitude greater than the largest training corpus previously used for this task, we trained the five learners and tested on a set of 1 million words of Wall Street Journal text.</Paragraph>
    <Paragraph position="3">  In Figure 2 we show learning curves for each learner, for up to one billion words of training data.</Paragraph>
    <Paragraph position="4">  Each point in the graph reflects the average performance of a learner over ten different confusion sets which are listed in Table 1. Interestingly, even out to a billion words, the curves appear to be log-linear. Note that the worst learner trained on approximately 20 million words outperforms the best learner trained on 1 million words. We see that for the problem of confusable disambiguation, none of our learners is close to asymptoting in performance when trained on the one million word training corpus commonly employed within the field.</Paragraph>
    <Paragraph position="5">  {accept, except} {principal, principle} {affect, effect} {then, than} {among, between} {their, there} {its, it's} {weather, whether} {peace, piece} {your, you're}  The graph in Figure 2 demonstrates that for word confusables, we can build a system that considerably outperforms the current best results using an incredibly simplistic learner with just slightly more training data. In the graph, Learner 1 corresponds to a trivial memory-based learner. This learner simply keeps track of</Paragraph>
    <Paragraph position="7"> &gt; counts for all occurrences of the confusables in the training set. Given a test set instance, the learner will first check if it has seen &lt;w</Paragraph>
    <Paragraph position="9"> &gt; in the training set.</Paragraph>
    <Paragraph position="10"> If so, it chooses the confusable word most frequently observed with this tuple. Otherwise, the learner backs off to check for the  We used the standard feature set for this problem. For details see [4].</Paragraph>
    <Paragraph position="11">  The training set contained no text from WSJ.  Learner 5 could not be run on more than 100 million words of training data.</Paragraph>
    <Paragraph position="12"> set member as computed from the training corpus. Note that with 10 million words of training data, this simple learner outperforms all other learners trained on 1 million words. Many papers in empirical natural language processing involve showing that a particular system (only slightly) outperforms others on one of the popular standard tasks. These comparisons are made from very small training corpora, typically less than a million words. We have no reason to believe that any comparative conclusions drawn on one million words will hold when we finally scale up to larger training corpora. For instance, our simple memory based learner, which appears to be among the best performers at a million words, is the worst performer at a billion. The learner that performs the worst on a million words of training data significantly improves with more data.</Paragraph>
    <Paragraph position="13"> Of course, we are fortunate in that labeled training data is easy to locate for confusion set disambiguation. For many natural language tasks, clearly this will not be the case. This reality has sparked interest in methods for combining supervised and unsupervised learning as a way to utilize the relatively small amount of available annotated data along with much larger collections of unannotated data [1,9]. However, it is as yet unclear whether these methods are effective other than in cases where we have relatively small amounts of annotated data available.</Paragraph>
  </Section>
  <Section position="6" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4. RESIDUAL ERRORS
</SectionTitle>
    <Paragraph position="0"> After eliminating errors arising from sparse data and examining the residual errors the learners make when trained on a billion words, we can begin to understand inherent weaknesses in ourlearning algorithms and feature sets. Sparse data problems can always be reduced by buying additional data; the remaining problems truly require technological advances to resolve them.</Paragraph>
    <Paragraph position="1"> We manually examined a sample of errors classifiers made when trained on one billion words and classified them into one of four categories: strongly misleading features, ambiguous context, sparse context and corpus error. In the paragraphs that follow, we define the various error types, and discuss what problems remain even after a substantial decrease in the number of errors attributed to the problem of sparse data.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Strongly Misleading Features
</SectionTitle>
      <Paragraph position="0"> Errors arising from strongly misleading features occur when features which are strongly associated with one class appear in the context of another. For instance, in attempting to characterize the feature set of weather (vs. its commonly-confused set member whether), according to the canonical feature space used for this problem we typically expect terms associated with atmospheric conditions, temperature or natural phenomena to favor use of weather as opposed to whether. Below is an example which illustrates that such strong cues are not always sufficient to accurately disambiguate between these confusables. In such cases, a method for better weighing features based upon their syntactic context, as opposed to using a simple bag-of-words model, may be needed.</Paragraph>
      <Paragraph position="1"> Example: On a sunny day whether she swims or not depends on the temperature of the water.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Ambiguous Context
</SectionTitle>
      <Paragraph position="0"> Errors can also arise from ambiguous contexts. Such errors are made when feature sets derived from shallow local contexts are not sufficient to disambiguate among members of a confusable set. Long-range, complex dependencies, deep semantic understanding or pragmatics may be required in order to draw a distinction among classes. Included in this class of problems are so-called &amp;quot;garden-path&amp;quot; sentences, in which ambiguity causes an incorrect parse of the sentence to be internally constructed by the reader until a certain indicator forces a revision of the sentence structure.</Paragraph>
      <Paragraph position="1">  evaluate weather reports at least four times a day to determine if delivery schedules should be modified.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Sparse Context
</SectionTitle>
      <Paragraph position="0"> Errors can also be a result of sparse contexts. In such cases, an informative term appears, but the term was not seen in the training corpus. Sparse contexts differ from ambiguous contexts in that with more data, such cases are potentially solvable using the current feature set. Sparse context problems may also be lessened by attributing informative lexical features to a word via clustering or other analysis.</Paragraph>
      <Paragraph position="1"> Example: It's baseball's only team-owned spring training site.</Paragraph>
      <Paragraph position="2"> Corpus Error Corpus errors are attributed to cases in which the test corpus contains an incorrect use of a confusable word, resulting in incorrectly evaluating the classification made by a learner. In a well-edited test corpus such as the Wall Street Journal, errors of this nature will be minimal.</Paragraph>
      <Paragraph position="3"> Example: If they don't find oil, its going to be quite a letdown. Table 2 shows the distribution of error types found after learning with a 1-billion-word corpus. Specifically, the sample of errors studied included instances that one particular learner, winnow, incorrectly classified when trained on one billion words. It is interesting that more than half of the errors were attributed to sparse context. Such errors could potentially be corrected were the learner to be trained on an even larger training corpus, or if other methods such as clustering were used.</Paragraph>
      <Paragraph position="4"> The ambiguous context errors are cases in which the feature space currently utilized by the learners is not sufficient for disambiguation; hence, simply adding more data will not help.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="4" end_page="5" type="metho">
    <SectionTitle>
5. A BILLION-WORD TREEBANK?
</SectionTitle>
    <Paragraph position="0"> Our experiments demonstrate that for confusion set disambiguation, system performance improves with more data, up to at least one billion words. Is it feasible to think of ever having a billion-word Treebank to use as training material for tagging, parsing, named entity recognition, and other applications? Perhaps not, but let us run through some numbers.</Paragraph>
    <Paragraph position="1"> To be concrete, assume we want a billion words annotated with part of speech tags at the same level of accuracy as the original million word corpus.</Paragraph>
    <Paragraph position="2">  If we train a tagger on the existing corpus, the naive approach would be to have a person look at every single tag in the corpus, decide whether it is correct, and make a change if it is not. In the extreme, this means somebody has to look at one billion tags. Assume our automatic tagger has an accuracy of 95% and that with reasonable tools, a person can verify at the rate of 5 seconds per tag and correct at the rate of 15 seconds per tag. This works out to an average of 5*.95 + 15*.05 = 5.5 seconds spent per tag, for a total of 1.5 million hours to tag a billion words. Assuming the human tagger incurs a cost of $10/hour, and assuming the annotation takes place after startup costs due to development of an annotation system have been accounted for, we are faced with $15 million in labor costs. Given the cost and labor requirements, this clearly is not feasible. But now assume that we could do perfect error identification, using sample selection techniques. In other words, we could first run a tagger over the billion-word corpus and using sample selection, identify all and only the errors made by the tagger. If the tagger is 95% accurate, we now only have to examine 5% of the corpus, at a correction cost of 15 seconds per tag. This would reduce the labor cost to $2 million for tagging a billion words. Next, assume we had a way of clustering errors such that correcting one tag on average had the effect of correcting 10. This reduces the total labor cost to $200k to annotate a billion words, or $20k to annotate 100 million. Suppose we are off by an order of magnitude; then with the proper technology in place it might cost $200k in labor to annotate 100 million additional words.</Paragraph>
    <Paragraph position="3"> As a result of the hypothetical analysis above, it is not absolutely infeasible to think about manually annotating significantly larger corpora. Given the clear benefit of additional annotated data, we should think seriously about developing tools and algorithms that would allow us to efficiently annotate orders of magnitude more data than what is currently available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML