File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2005_metho.xml

Size: 15,375 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2005">
  <Title>Bagging and Boosting a Treebank Parser</Title>
  <Section position="4" start_page="34" end_page="36" type="metho">
    <SectionTitle>
3 Boosting
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
3.1 Background
</SectionTitle>
      <Paragraph position="0"> The AdaBoost algorithm was presented by Freund and Schapire in 1996 (Freund and Schapire, 1996; Freund and Schapire, 1997) and has become a widely-known successful method in machine learning. The AdaBoost algorithm imposes one constraint on its underlying learner: it may abstain from making predictions about labels of some samples, 1This is the balanced version ofF-measure, where precision and recall are weighted equally.</Paragraph>
      <Paragraph position="1"> but it must consistently be able to get more than 50deg-/o accuracy on the samples for which it commits to a decision. That accuracy is measured according to the distribution describing the importance of samples that it is given. The learner must be able to get more correct samples than incorrect samples by mass of importance on those that it labels. This statement of the restriction comes from Schapire and Singer's study (1998). It is called the weak learning criterion.</Paragraph>
      <Paragraph position="2"> Schapire and Singer (1998) extended AdaBoost by describing how to choose the hypothesis mixing coefficients in certain circumstances and how to incorporate a general notion of confidence scores. They also provided a better characterization of its theoretical performance. The version of AdaBoost used in their work is shown in Algorithm 3, as it is the version that most amenable to parsing.</Paragraph>
      <Paragraph position="3"> Algorithm: AdaBoost (Freund and Schapire, 1997&amp;quot;) (3) Given: Training set /: as in bagging, except yi E {-1, 1 } is the label for example xi. Initial uniform distribution D1 (i) = 1/m. Number of iterations, T.</Paragraph>
      <Paragraph position="4"> Counter t = 1. tI,, C/~, and C/ are as in Bagging.</Paragraph>
      <Paragraph position="5">  1. Create Lt by randomly choosing with replacement m samples from L: using distribution Dt. 2. Classifier induction: Ct ~- ~(Lt) 3. Choose at E IR.</Paragraph>
      <Paragraph position="6"> 4. Adjust and normalize the distribution. Zt is a normalization coefficient.</Paragraph>
      <Paragraph position="7"> 1 D, + , ( i) = -~- Dt ( i ) exp(-c~tYiCt( xi ) ) 5. Increment t. Quit if t &gt; T.</Paragraph>
      <Paragraph position="8"> 6. Repeat from step 1.</Paragraph>
      <Paragraph position="9"> 7. The final hypothesis is</Paragraph>
      <Paragraph position="11"> in order to minimize the expected per-sample training error of the ensemble, which Schapire and Singer show can be concisely expressed by I-\] Zt. They also give several examples for how to pick an appropriate a, and selection generally depends on the possible outputs of the underlying learner.</Paragraph>
      <Paragraph position="12"> Boosting has been used in a few NLP systems.</Paragraph>
      <Paragraph position="13"> Haruno et al. (1998) used boosting to produce more accurate classifiers which were embedded as control  mechanisms of a parser for Japanese. The creators of AdaBoost used it to perform text classification (Schapire and Singer, 2000). Abney et al. (1999) performed part-of-speech tagging and prepositional phrase attachment using AdaBoost as a core component. They found they could achieve accuracies on both tasks that were competitive with the state of the art. As a side effect, they found that inspecting the samples that were consistently given the most weight during boosting revealed some faulty annotations in the corpus. In all of these systems, AdaBoost has been used as a traditional classification system.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
3.2 Boosting for Parsing
</SectionTitle>
      <Paragraph position="0"> Our goal is to recast boosting for parsing while considering a parsing system as the embedded learner.</Paragraph>
      <Paragraph position="1"> The formulation is given in Algorithm 4. The intuition behind the additive form is that the weight placed on a sentence should be the sum of the weight we would like to place on its constituents. The weight on constituents that are predicted incorrectly are adjusted by a factor of 1 in contrast to a factor of ~ for those that are predicted incorrectly.</Paragraph>
      <Paragraph position="2"> Algorithm: Boosting A Parser (4) Given corpus C with size m = IC I = ~s.~C(s,t) and parser induction algorithm g. Initial uniform distribution Dl(i) = 1/m. Number of iterations, T.</Paragraph>
      <Paragraph position="3">  Counter t = 1.</Paragraph>
      <Paragraph position="4"> 1. Create Ct by randomly choosing with replacement m samples from C using distribution Dr. 2. Create parser ft ~ g(Ct).</Paragraph>
      <Paragraph position="5"> 3. Choose at E R (described below).</Paragraph>
      <Paragraph position="6"> 4. Adjust and normalize the distribution. Zt is  a normalization coefficient. For all i, let parse tree ~-~' ~-- ft(s,). Let ~(T,c) be a function indicating that c is in parse tree r, and ITI is the number of constituents in tree T. T(s) is the set of constituents that are found in the reference or hypothesized annotation for s.</Paragraph>
      <Paragraph position="8"> 5. Increment t. Quit if t &gt; T.</Paragraph>
      <Paragraph position="9"> 6. Repeat from step 1.</Paragraph>
      <Paragraph position="10"> 7. The final hypothesis is computed by combin null ing the individual constituents. Each parser Ct in the ensemble gets a vote with weight at for the constituents they predict. Precisely those constituents with weight strictly larger than 1 ~--~t at are put into the final hypothesis. A potential constituent can be considered correct if it is predicted in the hypothesis and it exists in the reference, or it is not predicted and it is not in the reference. Potential constituents that do not appear in the hypothesis or the reference should not make a big contribution to the accuracy computation. There are many such potential constituents, and if we were maximizing a function that treated getting them incorrect the same as getting a constituent that appears in the reference correct, we would most likely decide not to predict any constituents. null Our model of constituent accuracy is thus simple. Each prediction correctly made over T(s) will be given equal weight. That is, correctly hypothesizing a constituent in the reference will give us one point, but a precision or recall error will cause us to miss one point. Constituent accuracy is then a/(a+b+c), where a is the number of constituents correctly hypothesized, b is the number of precision errors and c is the number of recall errors.</Paragraph>
      <Paragraph position="11"> In Equation 1, a computation of aca as described is shown.</Paragraph>
      <Paragraph position="13"> Boosting algorithms were developed that attempted to maximize F-measure, precision, and recall by varying the computation of a, giving results too numerous to include here. The algorithm given here performed the best of the lot, but was only marginally better for some metrics.</Paragraph>
      <Paragraph position="14"> (1:</Paragraph>
    </Section>
    <Section position="3" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
3.3 Experiment
</SectionTitle>
      <Paragraph position="0"> The experimental results for boosting are shown in Figure 3 and Table 2. There is a large plateau in performance from iterations 5 through 12. Because of their low accuracy and high degree of specialization, the parsers produced in these iterations had little weight during voting and had little effect on the cumulative decision making.</Paragraph>
      <Paragraph position="1"> As in the bagging experiment, it appears that there would be more precision and recall gain to be had by creating a larger ensemble. In both the bagging and boosting experiments time and resource constraints dictated our ensemble size.</Paragraph>
      <Paragraph position="2"> In the table we see that the boosting algorithm equaled bagging's test set gains in precision and recall. The Initial performance for boosting was lower, though. We cannot explain this, and expect it is due to unfortunate resampling of the data during the first iteration of boosting. Exact sentence accuracy, though, was not significantly improved on the test set.</Paragraph>
      <Paragraph position="3"> Overall, we prefer bagging to boosting for this problem when raw performance is the goal. There are side effects of boosting that are useful in other respects, though, which we explore in Section 4.2.</Paragraph>
    </Section>
    <Section position="4" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
3.3.1 Weak Learning Criterion Violations
</SectionTitle>
      <Paragraph position="0"> It was hypothesized in the course of investigating the failures of the boosting algorithm that the parser induction system did not satisfy the weak learning criterion. It was noted that the distribution of boosting weights were more skewed in later iterations. Inspection of the sentences that were getting much mass placed upon them revealed that their weight was being boosted in every iteration. The hypothesis was that the parser was simply unable to learn them.</Paragraph>
      <Paragraph position="1"> 39832 parsers were built to test this, one for each sentence in the training set. Each of these parsers was trained on only a single sentence 2 and evaluated on the same sentence. It was discovered that a full 4764 (11.2%) of these sentences could not be parsed completely correctly by the parsing system.</Paragraph>
      <Paragraph position="2">  In order to evaluate how well boosting worked with a learner that better satisfied the weak learning criterion, the boosting experiment was run again on the Treebank minus the troublesome sentences described above. The results are in Table 3. This dataset produces a larger gain in comparison to the results using the entire Treebank. The initial accuracy, however, is lower. We hypothesize that the boosting algorithm did perform better here, but the parser induction system was learning useful information in those sentences that it could not memorize (e.g. lexical information) that was successfully applied to the test set.</Paragraph>
      <Paragraph position="3"> In this manner we managed to clean our dataset to the point that the parser could learn each sentence in isolation. The corpus-makers cannot necessarily be blamed for the sentences that could not be memorized. All that can be said about those sentences is that for better or worse, the parser's model would not accommodate them.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="36" end_page="37" type="metho">
    <SectionTitle>
4 Corpus Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
4.1 Noisy Corpus: Empirical Investigation
</SectionTitle>
      <Paragraph position="0"> To acquire experimental evidence of noisy data, distributions that were used during boosting the stable corpus were inspected. The distribution was expected to be skewed if there was noise in the data, or be uniform with slight fluctuations if it fit the data well.</Paragraph>
      <Paragraph position="1"> We see how the boosting weight distribution changes in Figure 1. The individual curves are indexed by boosting iteration in the key of the figure. This training run used a corpus of 5000 sentences.</Paragraph>
      <Paragraph position="2"> The sentences are ranked by the weight they are given in the distribution, and sorted in decreasing order by weight along the x-axis. The distribution was smoothed by putting samples into equal weight bins, and reporting the average mass of samples in the bin as the y-coordinate. Each curve on this graph corresponds to a boosting iteration. We used 1000 bins for this graph, and a log scale on the x-axis. Since there were 5000 samples, all samples initially had a y-value of 0.0002.</Paragraph>
      <Paragraph position="3">  Notice first that the left endpoints of the lines move from bottom to top in order of boosting iteration. The distribution becomes monotonically more skewed as boosting progresses. Secondly we see by the last iteration that most of the weight is concentrated on less than 100 samples. This graph shows behavior consistent with noise in the corpus on which the boosting algorithm is focusing.</Paragraph>
    </Section>
    <Section position="2" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
4.2 Treebank Inconsistencies
</SectionTitle>
      <Paragraph position="0"> There are sentences in the corpus that can be learned by the parser induction algorithm in isolation but not in concert because they contain conflicting information. Finding these sentences leads to a better understanding of the quality of our corpus, and gives an idea for where improvements in annotation quality can be made. Abney et al. (1999) showed a similar corpus analysis technique for part of speech tagging and prepositional phrase tagging, but for parsing we must remove errors introduced by the parser as we did in Section 3.3.2 before questioning the corpus quality. A particular class of errors, inconsistencies, can then be investigated. Inconsistent annotations are those that appear plausible in isolation, but which conflict with annotation decisions made elsewhere in the corpus.</Paragraph>
      <Paragraph position="1"> In Figure 5 we show a set of trees selected from within the top 100 most heavily weighted trees at the end of 15 iterations of boosting the stable corpus.Collins's parser induction system is able to learn to produce any one of these structures in isolation, but the presence of conflicting information in different sentences prevents it from achieving 100% accuracy on the set.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="37" end_page="39832" type="metho">
    <SectionTitle>
5 Training Corpus Size Effects
</SectionTitle>
    <Paragraph position="0"> We suspect our best parser diversification techniques gives performance gain approximately equal to doubling the size of the training set. While this cannot be directly tested without hiring more annotators, an expected performance bound for a larger training set can be produced by extrapolating from how well the parser performs using smaller training sets.</Paragraph>
    <Paragraph position="1"> There are two characteristics of training curves for large corpora that can provide such a bound: training curves generally increase monotonically in the absence of over-training, and their first derivatives generally decrease monotonically.</Paragraph>
    <Paragraph position="2">  The training curves we present in Figure 4 and Table 4 suggest that roughly doubling the corpus size  in the range of interest (between 10000 and 40000 sentences) gives a test set F-measure gain of approximately 0.70.</Paragraph>
    <Paragraph position="3"> Bagging achieved significant gains of approximately 0.60 over the best reported previous F-measure without adding any new data. In this respect, these techniques show promise for making performance gains on large corpora without adding more data or new parsers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML