XML Viewer - w99-0606

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0606_metho.xml
Size: 16,032 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0606">
  <Title>Boosting Applied to Tagging and PP Attachment</Title>
  <Section position="3" start_page="0" end_page="40" type="metho">
    <SectionTitle>
2 The boosting algorithm AdaBoost
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the boosting algorithm AdaBoost that we used in our experiments.</Paragraph>
    <Paragraph position="1"> AdaBoost was first introduced by Freund and Schapire (1997); the version described here is a (slightly simplified) version of the one given by Schapire and Singer (1998). A formal description of AdaBoost is shown in Figure 1. AdaBoost takes as input a training set of m labeled examples ((xl,yl),..., (Xrn, Ym)) where xi is an example (say, as described by a vector of attribute values), and Yi E {-1, -l--l} is the label associated with xi. For now, we focus on the binary case, in which only two labels (positive or negative) are possible.</Paragraph>
    <Paragraph position="2"> Multiclass problems are discussed later.</Paragraph>
    <Paragraph position="3"> Formally, the rules of thumb mentioned in the introduction are called weak hypotheses. Boosting assumes access to an algorithm or subroutine for generating weak hypotheses called the weak learner. Boosting can be combined with any suitable weak learner; the one that we used will be described below.</Paragraph>
    <Paragraph position="4"> AdaBoost calls the weak learner repeatedly in a series of rounds. On round t, AdaBoost provides the weak learner with a set of importance weights over the training set. In response, the weak learner com- null Given: (xl, yl),..., (Xm, Ym) where xi E X, Yi E {-1, +1}</Paragraph>
    <Paragraph position="6"> where Zt is a normalization factor (chosen so that Dt+l will be a distribution).</Paragraph>
    <Paragraph position="7"> Output the final hypothesis:  putes a weak hypothesis ht that maps each example x to a real number ht(x). The sign of this number is interpreted as the predicted class (-1 or +1) of example z, while the magnitude \]ht(z)\] is interpreted as the level of confidence in the prediction, with larger values corresponding to more confident predictions.</Paragraph>
    <Paragraph position="8"> The importance weights are maintained formally as a distribution over the training set. We write Dr(i) to denote the weight of the ith training example (xi, Yi) on the tth round of boosting. Initially, the distribution is uniform. Having obtained a hypothesis ht from the weak learner, AdaBoost updates the weights by multiplying the weight of each example i by I e -ylht(xi). If ht incorrectly classified example i so that ht (xi) and Yi disagree in sign, then this has the effect of increasing the weight on this example, and conversely the weights of correctly classified examples are decreased. Moreover, the greater the confidence of the prediction (that is, the greater the magnitude of ht(xi) ), the more drastic will be the effect of the update. The weights are then renormalized, resulting in the update rule shown in the figure.</Paragraph>
    <Paragraph position="9"> In our experiments, we used cross validation to choose the number of rounds T. After T rounds, JSchapire and Singer (1998) multiply instead by exp(-yioetht(xi)) where at E ~ is a parameter that needs to be set. In the description presented here, we fold at into ht. AdaBoost outputs a final hypothesis which makes predictions using a simple vote of the weak hypotheses' predictions, taking into account the varying confidences of the different predictions. A new example x is classified using</Paragraph>
    <Paragraph position="11"> where the label predicted for x is sign(ff(x)).</Paragraph>
    <Section position="1" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
2.1 Finding weak hypotheses
</SectionTitle>
      <Paragraph position="0"> In this section, we describe the weak learner used in our experiments. Since we now focus on what happens on a single round of boosting, we will drop t subscripts where possible.</Paragraph>
      <Paragraph position="1"> Schapire and Singer (1998) prove that the training error of the final hypothesis is at most yItr=l Zt. This suggests that the training error can be greedily driven down by designing a weak learner which, on round t of boosting, attempts to find a weak hypothesis h that minimizes</Paragraph>
      <Paragraph position="3"> This is the principle behind the weak learner used in our experiments.</Paragraph>
      <Paragraph position="4"> In all our experiments, we use very simple weak hypotheses that test the value of a Boolean predicate and make a prediction based on that value. The predicates used are of the form &amp;quot;a = v&amp;quot;, for a an attribute and v a value; for example, &amp;quot;PreviousWord = the&amp;quot;. In the PP-attachment experiments, we also considered conjunctions of such predicates. If, on a given example x, the predicate holds, the weak hypothesis outputs prediction Pl, otherwise P0, where Pl and P0 are determined by the training data in a way we describe shortly. In this setting, weak hypotheses can be identified with predicates, which in turn can be thought of as features of the examples; thus, in this setting, boosting can be viewed as a feature-selection method.</Paragraph>
      <Paragraph position="5"> Let C/(z) E {0, 1} denote the value of the predicate C/ on the example z, and for b E {0, 1}, let Pb E IR be the prediction of the weak hypothesis when C/(x) = b. Then we can write simply h(x) = PC(z). Given a predicate C/, we choose P0 and Pl to minimize Z. Schapire and Singer (1998) show that Z is minimized when we let</Paragraph>
      <Paragraph position="7"> in the literature. B = (Brill and Wu, 1998); M = (Magerman, 1995); O = our data; R = (Ratnaparkhi, 1996); W = (Weischedel and others, 1993). forb E {0,1} where Ws bisthesum of D(i) for examples i such that yi = s and C/(xi) = b. This choice of p# implies that</Paragraph>
      <Paragraph position="9"> This expression can now be minimized over all choices of C/.</Paragraph>
      <Paragraph position="10"> Thus, our weak learner works by searching for the predicate C/ that minimizes Z of Eq. (2), and the resulting weak hypothesis h(x) predicts Pc(z) of Eq. (1) on example x.</Paragraph>
      <Paragraph position="11"> In practice, very large values of p0 and pl can cause numerical problems and may also lead to overfitting. Therefore, we usually &amp;quot;smooth&amp;quot; these values using the following alternate choice of Pb given by Schapire and Singer (1998): (W+ba a t-q'-~) pb = 1/2 In \~-~ (3) where e is a small positive number.</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="40" type="sub_section">
      <SectionTitle>
2.2 Multiclass problems
</SectionTitle>
      <Paragraph position="0"> So far, we have only discussed binary classification problems. In the multiclass case (in which more than two labels are possible), there are many possible extensions of AdaBoost (Freund and Schapire, 1997; Schapire, 1997; Schapire and Singer, 1998).</Paragraph>
      <Paragraph position="1"> Our default approach to multiclass problems is to use Schapire and Singer's (1998) AdaBoost.MH algorithm. The main idea of this algorithm is to regard each example with its multiclass label as several binary-labeled examples.</Paragraph>
      <Paragraph position="2"> More precisely, suppose that the possible classes are 1,...,k. We map each original example x  with label y to k binary labeled derived examples</Paragraph>
      <Paragraph position="4"> lem. We maintain a distribution over pairs (x, c), treating each such as a separate example. Weak hypotheses are identified with predicates over (x, c) pairs, though they now ignore c, so that we can continue to use the same space of predicates as before. The prediction weights c c P0, Pl, however, are chosen separately for each class c; we have ht(x, c) = P~,(z)&amp;quot; Given a new example x, the final hypothesis makes confidence-weighted predictions</Paragraph>
      <Paragraph position="6"> tion questions (c = 1? c = 2? etc.); the class is predicted to be the value of c that maximizes f(x, c).</Paragraph>
      <Paragraph position="7"> For more detail, see the original paper (Schapire and Singer, 1998).</Paragraph>
      <Paragraph position="8"> When memory limitations prevent the use of AdaBoost.MH, an alternative we have pursued is to use binary AdaBoost to train separate discriminators (binary classifiers) for each class, and combine their output by choosing the class c that maximizes re(x), where fc(x) is the final confidence-weighted prediction of the discriminator for class c. Let us call this algorithm AdaBoost.MI (multiclass, independent discriminators). It differs from AdaBoost.MH in that predicates are selected independently for each class; we do not require that the weak hypothesis at round t be the same for all classes. The number of rounds may also differ from discriminator to discriminator.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="40" end_page="42" type="metho">
    <SectionTitle>
3 Tagging
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.1 Corpus
</SectionTitle>
      <Paragraph position="0"> To facilitate comparison with previous results, we used the UPenn Treebank corpus (Marcus et al., 1993). The corpus uses 80 labels, which comprise 45 parts of speech properly so-called, and 35 indeterminate tags, representing annotator uncertainty.</Paragraph>
      <Paragraph position="1"> We introduce an 81 st label, ##, for paragraph separators. null An example of an indeterminate tag is NNIO'd, which indicates that the annotator could not decide between NN and ,30. The &amp;quot;right&amp;quot; thing to do with indeterminate tags would either be to eliminate them or to count the tagger's output as correct if it agrees with any of the alternatives. Previous work appears to treat them as separate tags, however, and we have followed that precedent.</Paragraph>
      <Paragraph position="2"> We partitioned the corpus into three samples: a test sample consisting of 1000 randomly selected  paragraphs (54,194 tokens), a development sample, also of 1000 paragraphs (52,087 tokens), and a training sample' (1,207,870 tokens).</Paragraph>
      <Paragraph position="3"> Some previously reported results on the Treebank corpus are summarized in Table 1. These results are all based on the Treebank corpus, but it appears that they do not all use the same training-test split, nor the same preprocessing, hence there may be differences in details of examples and labels. The &amp;quot;MF tag&amp;quot; method simply uses the most-frequent tag from training as the predicted label. The voting scheme combines the outputs of four other taggers.</Paragraph>
    </Section>
    <Section position="2" start_page="40" end_page="42" type="sub_section">
      <SectionTitle>
3.2 Applying Boosting to Tagging
</SectionTitle>
      <Paragraph position="0"> The straightforward way of applying boosting to tagging is to use AdaBoost.MH. Each word token represents an example, and the classes are the 81 part-of-speech tags. Weak hypotheses are identified with &amp;quot;attribute=value&amp;quot; predicates. We use a rather spare attribute set, encoding less context than  is usual. The attributes we use are: * Lexical attributes: The current word as a downcased string (S); its capitalization (C); and its most-frequent tag in training (T). T is unknown for unknown words.</Paragraph>
      <Paragraph position="1"> * Contextual attributes: the string (LS), capi null talization (LC), and most-frequent tag (LT) of the preceding word; and similarly for the following word (RS, RC, RT).</Paragraph>
      <Paragraph position="2"> * Morphological attributes: the inflectional suffix (I) of the current word, as provided by an automatic stemmer; also the last two ($2) and last three ($3) letters of the current word. We note in passing that the single attribute T is a good predictor of the correct label; using T as the predicted label gives a 7.7% error rate (see Table 1). Experiment 1. Because of memory limitations, we could not apply AdaBoost.MH to the entire training sample. We examined several approximations. The simplest approximation (experiment 1) is to run AdaBoost.MH on 400K training examples,  instead of the full training set. Doing so yields a test error of 3.68%, which is actually as good as using Markov 3-grams (Table 1).</Paragraph>
      <Paragraph position="3"> Experiment 2. In experiment 2, we divided the training data into four quarters, trained a classifier using AdaBoost.MH on each quarter, and combined the four classifiers using (loosely speaking) a final round of boosting. This improved test error significantly, to 3.32%. In fact, this tagger performs as well as any single tagger in Table 1 except the Max-ent tagger.</Paragraph>
      <Paragraph position="4"> Experiment 3. In experiment 3, we reduced the training sample by eliminating unambiguous words (multiple tags attested in training) and indefinite tags. We examined all indefinite-tagged examples and made a forced choice among the alternatives. The result is not strictly comparable to results on the larger tagset, but since only 5 out of 54K test examples are affected, the difference is negligible. This yielded a multiclass problem with 648K examples and 39 classes. We constructed a separate classifier for unknown words, using AdaBoost.MH. We used hapax legomena (words appearing once) from our training sample to train it. The error rate on unknown words was 19.1%. The overall test error rate was 3.59%, intermediate between the error rates in the two previous experiments.</Paragraph>
      <Paragraph position="5"> Experiment 4. One obvious way of reducing the training data would be to train a separate classifier for each word. However, that approach would result in extreme data fragmentation. An alternative is to cut the data in the other direction, and build a separate discriminator for each part of speech, and  function of the number of rounds of boosting for the PP-attachment problem.</Paragraph>
      <Paragraph position="6"> combine them by choosing the part of speech whose discriminator predicts 'Yes' with the most confidence (or 'No' with the least confidence). We took this approach--algorithm AdaBoost.MI--in experiment 4. To choose the appropriate number of rounds for each discriminator, we did an initial run, and chose the point at which error on the development sample flattened out. To handle unknown words, we used the same unknown-word classifier as in experiment 3.</Paragraph>
      <Paragraph position="7"> The result was the best for any of our experiments: a test error rate of 3.28%, slightly better than experiment 2. The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi's Maxent tagger, which achieves 3.11% error on our data.</Paragraph>
      <Paragraph position="8"> Our results are not as good as those achieved by Brill and Wu's voting scheme. The experiments we describe here use very simple features, like those used in the Maxent or transformation-based taggers; hence the results are not comparable to the multipletagger voting scheme. We are optimistic that boosting would do well with tagger predictions as input features, but those experiments remain to be done. Table 2 breaks out the error sources for experiment 4. Table 3 sums up the results of all four experiments. null Experiment 5 (Sequential model). To this point, tagging decisions are made based on local context only. One would expect performance to improve if we consider a Viterbi-style optimization to choose a globally best sequence of labels. Using decision sequences also permits one to use true tags, rather  than most-frequent tags, on context tokens. We did a fixed 500 rounds of boosting, testing against the development sample. Surprisingly, the sequential model performed much less well than the localdecision models. The results are summarized in Table 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML