Boosting Applied to Tagging and PP Attachment 
Steven Abney 
{abney, 
Robert E. Schapire 
AT&T Labs - Research 
180 Park Avenue 
Florham Park, NJ 07932 
schapire, 
Yoram Singer 
singer}@research.att.com 
Abstract 
Boosting is a machine learning algorithm that is not 
well known in computational linguistics. We ap- 
ply it to part-of-speech tagging and prepositional 
phrase attachment. Performance is very encourag- 
ing. We also show how to improve data quality by 
using boosting to identify annotation errors. 
1 Introduction 
Boosting is a machine learning algorithm that has 
been applied successfully to a variety of problems, 
but is almost unknown in computational linguis- 
tics. We describe experiments in which we apply 
boosting to part-of-speech tagging and prepositional 
phrase attachment. Results on both PP-attachment 
and tagging are within sampling error of the best 
previous results. 
The current best technique for PP-attachment 
(backed-off density estimation) does not perform 
well for tagging, and the current best technique for 
tagging (maxent) is below state-of-the-art on PP- 
attachment. Boosting achieves state-of-the-art per- 
formance on both tasks simultaneously. 
The idea of boosting is to combine many sim- 
ple "rules of thumb," such as "the current word is 
a noun if the previous word is the." Such rules of- 
ten give incorrect classifications. The main idea of 
boosting is to combine many such rules in a prin- 
cipled manner to produce a single highly accurate 
classification rule. 
There are similarities between boosting and 
transformation-based learning (Brill, 1993): both 
build classifiers by combining simple rules, and 
both are noted for their resistance to overfitting. 
But boosting, unlike transformation-based learning, 
rests on firm theoretical foundations; and it outper- 
forms transformation-based learning in our experi- 
ments. 
There are also superficial similarities between 
boosting and maxent. In both, the parameters are 
weights in a log-linear function. But in maxent, the 
log-linear function defines a probability, and the ob- 
jective is to maximize likelihood, which may not 
minimize classification error. In boosting, the log- 
linear function defines a hyperplane dividing exam- 
ples into (binary) classes, and boosting minimizes 
classification error directly. Hence boosting is usu- 
ally more appropriate when the objective is classifi- 
cation rather than density estimation. 
A notable property of boosting is that it maintains 
an explicit measure of how difficult it finds partic- 
ular training examples to be. The most difficult ex- 
amples are very often mislabelled examples. Hence, 
boosting can contribute to improving data quality by 
identifying annotation errors. 
2 The boosting algorithm AdaBoost 
In this section, we describe the boosting algo- 
rithm AdaBoost that we used in our experiments. 
AdaBoost was first introduced by Freund and 
Schapire (1997); the version described here is a 
(slightly simplified) version of the one given by 
Schapire and Singer (1998). A formal descrip- 
tion of AdaBoost is shown in Figure 1. AdaBoost 
takes as input a training set of m labeled exam- 
ples ((xl,yl),..., (Xrn, Ym)) where xi is an exam- 
ple (say, as described by a vector of attribute val- 
ues), and Yi E {-1, -l--l} is the label associated with 
xi. For now, we focus on the binary case, in which 
only two labels (positive or negative) are possible. 
Multiclass problems are discussed later. 
Formally, the rules of thumb mentioned in the 
introduction are called weak hypotheses. Boost- 
ing assumes access to an algorithm or subroutine 
for generating weak hypotheses called the weak 
learner. Boosting can be combined with any suit- 
able weak learner; the one that we used will be de- 
scribed below. 
AdaBoost calls the weak learner repeatedly in a 
series of rounds. On round t, AdaBoost provides the 
weak learner with a set of importance weights over 
the training set. In response, the weak learner com- 
38 
Given: (xl, yl),..., (Xm, Ym) 
where xi E X, Yi E {-1, +1} 
Initialize Di (i) =: 1/m. 
Fort = 1,...,T: 
• Train weak learner using distribution Dt. 
• Get weak hypothesis ht : X -4 ~. 
• Update: 
Dt+i(i) = Dt( i) exp(-yiht(xi) ) Zt 
where Zt is a normalization factor (chosen so 
that Dt+l will be a distribution). 
Output the final hypothesis: 
Figure 1: The boosting algorithm AdaBoost. 
putes a weak hypothesis ht that maps each example 
x to a real number ht(x). The sign of this num- 
ber is interpreted as the predicted class (-1 or +1) 
of example z, while the magnitude \]ht(z)\] is inter- 
preted as the level of confidence in the prediction, 
with larger values corresponding to more confident 
predictions. 
The importance weights are maintained formally 
as a distribution over the training set. We write 
Dr(i) to denote the weight of the ith training ex- 
ample (xi, Yi) on the tth round of boosting. Ini- 
tially, the distribution is uniform. Having obtained a 
hypothesis ht from the weak learner, AdaBoost up- 
dates the weights by multiplying the weight of each 
example i by I e -ylht(xi). If ht incorrectly classified 
example i so that ht (xi) and Yi disagree in sign, then 
this has the effect of increasing the weight on this 
example, and conversely the weights of correctly 
classified examples are decreased. Moreover, the 
greater the confidence of the prediction (that is, the 
greater the magnitude of ht(xi) ), the more drastic 
will be the effect of the update. The weights are then 
renormalized, resulting in the update rule shown in 
the figure. 
In our experiments, we used cross validation to 
choose the number of rounds T. After T rounds, 
JSchapire and Singer (1998) multiply instead by exp(-yioetht(xi)) 
where at E ~ is a parameter that needs to 
be set. In the description presented here, we fold at into ht. 
AdaBoost outputs a final hypothesis which makes 
predictions using a simple vote of the weak hy- 
potheses' predictions, taking into account the vary- 
ing confidences of the different predictions. A new 
example x is classified using 
T 
= 
t=l 
where the label predicted for x is sign(ff(x)). 
2.1 Finding weak hypotheses 
In this section, we describe the weak learner used 
in our experiments. Since we now focus on what 
happens on a single round of boosting, we will drop 
t subscripts where possible. 
Schapire and Singer (1998) prove that the train- 
ing error of the final hypothesis is at most yItr=l Zt. 
This suggests that the training error can be greedily 
driven down by designing a weak learner which, on 
round t of boosting, attempts to find a weak hypoth- 
esis h that minimizes 
m 
Z = ~ D(i)exp(-yih(xi)). 
i=1 
This is the principle behind the weak learner used in 
our experiments. 
In all our experiments, we use very simple weak 
hypotheses that test the value of a Boolean predi- 
cate and make a prediction based on that value. The 
predicates used are of the form "a = v", for a an 
attribute and v a value; for example, "PreviousWord 
= the". In the PP-attachment experiments, we also 
considered conjunctions of such predicates. If, on a 
given example x, the predicate holds, the weak hy- 
pothesis outputs prediction Pl, otherwise P0, where 
Pl and P0 are determined by the training data in a 
way we describe shortly. In this setting, weak hy- 
potheses can be identified with predicates, which in 
turn can be thought of as features of the examples; 
thus, in this setting, boosting can be viewed as a 
feature-selection method. 
Let ¢(z) E {0, 1} denote the value of the pred- 
icate ¢ on the example z, and for b E {0, 1}, let 
Pb E IR be the prediction of the weak hypothe- 
sis when ¢(x) = b. Then we can write simply 
h(x) = PC(z). Given a predicate ¢, we choose P0 
and Pl to minimize Z. Schapire and Singer (1998) 
show that Z is minimized when we let 
pb= ½ In ) (1) 
39 
MF tag O 7.66 
Markov 1-gram B 6.74 
Markov 3-gram W 3.7 
Markov 3-gram B 3.64 
Decision tree M 3.5 
Transformation B 3.39 
Maxent R 3.37 
Maxent O 3.11 ~.07 
Multi-tagger Voting B 2.84 :t=.03 
Table 1: TB-WSJ testing error previously reported 
in the literature. B = (Brill and Wu, 1998); M 
= (Magerman, 1995); O = our data; R = (Ratna- 
parkhi, 1996); W = (Weischedel and others, 1993). 
forb E {0,1} where Ws bisthesum of D(i) for 
examples i such that yi = s and ¢(xi) = b. This 
choice of p# implies that 
Z: 2 Z CW+blW-bl" (2) 
bE(O,1} 
This expression can now be minimized over all 
choices of ¢. 
Thus, our weak learner works by searching for 
the predicate ¢ that minimizes Z of Eq. (2), and 
the resulting weak hypothesis h(x) predicts Pc(z) 
of Eq. (1) on example x. 
In practice, very large values of p0 and pl can 
cause numerical problems and may also lead to 
overfitting. Therefore, we usually "smooth" these 
values using the following alternate choice of Pb 
given by Schapire and Singer (1998): 
(W+ba a t-q'-~) pb = ½ In \~-~ (3) 
where e is a small positive number. 
2.2 Multiclass problems 
So far, we have only discussed binary classification 
problems. In the multiclass case (in which more 
than two labels are possible), there are many pos- 
sible extensions of AdaBoost (Freund and Schapire, 
1997; Schapire, 1997; Schapire and Singer, 1998). 
Our default approach to multiclass problems is to 
use Schapire and Singer's (1998) AdaBoost.MH al- 
gorithm. The main idea of this algorithm is to re- 
gard each example with its multiclass label as sev- 
eral binary-labeled examples. 
More precisely, suppose that the possible classes 
are 1,...,k. We map each original example x 
40 
with label y to k binary labeled derived examples 
(x, 1),..., (x, k) where example (x, c) is labeled 
+1 if c = y and -1 otherwise. We then essen- 
tially apply binary AdaBoost to this derived prob- 
lem. We maintain a distribution over pairs (x, c), 
treating each such as a separate example. Weak hy- 
potheses are identified with predicates over (x, c) 
pairs, though they now ignore c, so that we can 
continue to use the same space of predicates as 
before. The prediction weights c c P0, Pl, however, 
are chosen separately for each class c; we have 
ht(x, c) = P~,(z)" Given a new example x, the final 
hypothesis makes confidence-weighted predictions 
f (x, c) = }2tr=l ht(x, c) for each of the discrimina- 
tion questions (c = 1? c = 2? etc.); the class is pre- 
dicted to be the value of c that maximizes f(x, c). 
For more detail, see the original paper (Schapire and 
Singer, 1998). 
When memory limitations prevent the use of Ad- 
aBoost.MH, an alternative we have pursued is to 
use binary AdaBoost to train separate discrimina- 
tors (binary classifiers) for each class, and com- 
bine their output by choosing the class c that max- 
imizes re(x), where fc(x) is the final confidence- 
weighted prediction of the discriminator for class 
c. Let us call this algorithm AdaBoost.MI (multi- 
class, independent discriminators). It differs from 
AdaBoost.MH in that predicates are selected inde- 
pendently for each class; we do not require that 
the weak hypothesis at round t be the same for all 
classes. The number of rounds may also differ from 
discriminator to discriminator. 
3 Tagging 
3.1 Corpus 
To facilitate comparison with previous results, we 
used the UPenn Treebank corpus (Marcus et al., 
1993). The corpus uses 80 labels, which comprise 
45 parts of speech properly so-called, and 35 inde- 
terminate tags, representing annotator uncertainty. 
We introduce an 81 st label, ##, for paragraph sepa- 
rators. 
An example of an indeterminate tag is NNIO'd, 
which indicates that the annotator could not decide 
between NN and ,30. The "right" thing to do with in- 
determinate tags would either be to eliminate them 
or to count the tagger's output as correct if it agrees 
with any of the alternatives. Previous work appears 
to treat them as separate tags, however, and we have 
followed that precedent. 
We partitioned the corpus into three samples: a 
test sample consisting of 1000 randomly selected 
ambig 
unambig 
unknown 
total 
n errors percent contrib 
28,557 (52.7%) 1396 4.89 2.58 
24,533 (45.3%) 167 0.68 0.31 
1104 (2.0%) 213 19.29 0.39 
54,194 1776 3.28 +0.08 
Table 2: Performance of the multi-discriminator approach. 
paragraphs (54,194 tokens), a development sam- 
ple, also of 1000 paragraphs (52,087 tokens), and 
a training sample' (1,207,870 tokens). 
Some previously reported results on the Treebank 
corpus are summarized in Table 1. These results are 
all based on the Treebank corpus, but it appears that 
they do not all use the same training-test split, nor 
the same preprocessing, hence there may be differ- 
ences in details of examples and labels. The "MF 
tag" method simply uses the most-frequent tag from 
training as the predicted label. The voting scheme 
combines the outputs of four other taggers. 
3.2 Applying Boosting to Tagging 
The straightforward way of applying boosting to 
tagging is to use AdaBoost.MH. Each word token 
represents an example, and the classes are the 81 
part-of-speech tags. Weak hypotheses are identi- 
fied with "attribute=value" predicates. We use a 
rather spare attribute set, encoding less context than 
is usual. The attributes we use are: 
• Lexical attributes: The current word as a 
downcased string (S); its capitalization (C); 
and its most-frequent tag in training (T). T is 
unknown for unknown words. 
• Contextual attributes: the string (LS), capi- 
talization (LC), and most-frequent tag (LT) of 
the preceding word; and similarly for the fol- 
lowing word (RS, RC, RT). 
• Morphological attributes: the inflectional 
suffix (I) of the current word, as provided by 
an automatic stemmer; also the last two ($2) 
and last three ($3) letters of the current word. 
We note in passing that the single attribute T is a 
good predictor of the correct label; using T as the 
predicted label gives a 7.7% error rate (see Table 1). 
Experiment 1. Because of memory limitations, 
we could not apply AdaBoost.MH to the entire 
training sample. We examined several approxima- 
tions. The simplest approximation (experiment 1) 
is to run AdaBoost.MH on 400K training examples, 
41 
Exp. 1 400K training 3.68 + .08 
Exp. 2 4×300K 3.32+.08 
Exp. 3 Unambiguous & definite 3.59 ± .08 
Exp. 4 AdaBoost.MI 3.28 4- .08 
Table 3: Performance on experiments 1-4. 
instead of the full training set. Doing so yields a test 
error of 3.68%, which is actually as good as using 
Markov 3-grams (Table 1). 
Experiment 2. In experiment 2, we divided the 
training data into four quarters, trained a classifier 
using AdaBoost.MH on each quarter, and combined 
the four classifiers using (loosely speaking) a final 
round of boosting. This improved test error signif- 
icantly, to 3.32%. In fact, this tagger performs as 
well as any single tagger in Table 1 except the Max- 
ent tagger. 
Experiment 3. In experiment 3, we reduced the 
training sample by eliminating unambiguous words 
(multiple tags attested in training) and indefinite 
tags. We examined all indefinite-tagged examples 
and made a forced choice among the alternatives. 
The result is not strictly comparable to results on 
the larger tagset, but since only 5 out of 54K test 
examples are affected, the difference is negligible. 
This yielded a multiclass problem with 648K exam- 
ples and 39 classes. We constructed a separate clas- 
sifier for unknown words, using AdaBoost.MH. We 
used hapax legomena (words appearing once) from 
our training sample to train it. The error rate on un- 
known words was 19.1%. The overall test error rate 
was 3.59%, intermediate between the error rates in 
the two previous experiments. 
Experiment 4. One obvious way of reducing the 
training data would be to train a separate classifier 
for each word. However, that approach would re- 
sult in extreme data fragmentation. An alternative 
is to cut the data in the other direction, and build a 
separate discriminator for each part of speech, and 
0.4 
0.35 
0.3 
0.25 
0.2 
0.15 
0.i 
0.05 
0 
. ,, . ., . . , 
Train ~. ~ Test ......... 
" ColiinS & Brooks ........... 
. . ,J , , i . . ,i . ~ . 
i0 I00 i000 I0000 
Number of rounds 
Figure 2: Training and test error as a 
function of the number of rounds of 
boosting for the PP-attachment problem. 
combine them by choosing the part of speech whose 
discriminator predicts 'Yes' with the most confi- 
dence (or 'No' with the least confidence). We took 
this approach--algorithm AdaBoost.MI--in exper- 
iment 4. To choose the appropriate number of 
rounds for each discriminator, we did an initial run, 
and chose the point at which error on the devel- 
opment sample flattened out. To handle unknown 
words, we used the same unknown-word classifier 
as in experiment 3. 
The result was the best for any of our experi- 
ments: a test error rate of 3.28%, slightly better than 
experiment 2. The 3.28% error rate is not signifi- 
cantly different (at p = 0.05) from the error rate of 
the best-known single tagger, Ratnaparkhi's Maxent 
tagger, which achieves 3.11% error on our data. 
Our results are not as good as those achieved by 
Brill and Wu's voting scheme. The experiments we 
describe here use very simple features, like those 
used in the Maxent or transformation-based taggers; 
hence the results are not comparable to the multiple- 
tagger voting scheme. We are optimistic that boost- 
ing would do well with tagger predictions as input 
features, but those experiments remain to be done. 
Table 2 breaks out the error sources for experi- 
ment 4. Table 3 sums up the results of all four ex- 
periments. 
Experiment 5 (Sequential model). To this point, 
tagging decisions are made based on local context 
only. One would expect performance to improve if 
we consider a Viterbi-style optimization to choose 
a globally best sequence of labels. Using decision 
sequences also permits one to use true tags, rather 
42 
than most-frequent tags, on context tokens. We 
did a fixed 500 rounds of boosting, testing against 
the development sample. Surprisingly, the sequen- 
tial model performed much less well than the local- 
decision models. The results are summarized in Ta- 
ble 4. 
4 Prepositional phrase attachment 
In this section, we discuss the use of boosting for 
prepositional phrase (PP) attachment. The cases 
of PP-attachment that we address define a binary 
classification problem. For example, the sentence 
Congress accused the president of peccadillos is 
classified according to the attachment site of the 
prepositional phrase: 
attachment to N: 
accused \[the president of peccadillos\] 
attachment to V: (4) 
accused \[the president\] \[of peccadillos\] 
The UPenn Treebank-II Parsed Wall Street Jour- 
nal corpus includes PP-attachment information, and 
PP-attachment classifiers based on this data have 
been previously described in Ratnaparkhi, Reynar, 
Roukos (1994), Brill and Resnik (1994), and Collins 
and Brooks (1995). We consider how to apply 
boosting to this classification task. 
We used the same training and test data as Collins 
and Brooks (1995). The instances of PP-attachment 
considered are those involving a verb immediately 
followed by a simple noun phrase (the direct ob- 
ject) and a prepositional phrase (whose attachment 
is at issue). Each PP-attachment example is repre- 
sented by its value for four attributes: the main verb 
(V), the head word of the direct object (N1), the 
preposition (P), and the head word of the object 
of the preposition (N2). For instance, in example 
4 above, V= accused, N1 = president, P = of 
and N2 = peccadillos. Examples have binary la- 
bels: positive represents attachment to noun, and 
negative represents attachment to verb. The train- 
ing set comprises 20,801 examples and the test set 
contains 3,097 examples; there is also a separate 
development set of 4,039 examples. 
The weak hypotheses we used correspond to "at- 
tribute=value" predicates and conjunctions thereof. 
That is, there are 16 predicates that are consid- 
ered for each example. For example 4, three of 
these 16 predicates are (V = accused A N1 = 
president A N2 = peccadillos), (P = with), and 
(V = accused A p = oJ). As described in section 
2.1, a weak hypothesis produces one of two real- 
valued predictions P0, Pl, depending on the value of 
errors percent 
Local decisions, LT/RT = most-frequent tag 
Local decisions, LT/RT = true tag 
Sequential decisions 
1489/52,087 3.18 
1418/52,087 3.04 
2083/52,087 4.00 
Table 4: Performance of the sequential model on the development sample. 
Round 
Table 5: The first five weak 
Test 
(P = of) 
(P = to) 
(N2 = NUMBER) 
(N1 = it) 
(P = at) 
Prediction 
+2.393 
-0.729 
-0.772 
-2.273 
-0.669 
hypotheses chosen for the PP-attachment classifier. 
its predicate. We found that little information was 
conveyed by knowing that a predicate is false. We 
therefore forced each weak hypothesis to abstain if 
its predicate is not satisfied--that is, we set P0 to 0 
for all weak hypotheses. 
Two free parameters in boosting are the num- 
ber of rounds T and the smoothing parameter e for 
the confidence values (see Eq. (3)). Although there 
are theoretical analyses of the number of rounds 
needed for boosting (Freund and Schapire, 1997; 
Schapire et al,, 1997) and for smoothing (Schapire 
and Singer, 1998), these tend not to give practical 
answers. We therefore used the development sam- 
ple to set these parameters, and chose T = 20,000 
and c = 0.001. 
On each round of boosting, we consider every 
predicate relevant to any example, and choose the 
one that minimizes Z as given by Eq. (2). In Ta- 
ble 5 we list the weak hypotheses chosen on the first 
five rounds of boosting, together with their assigned 
confidence Pl. Recall that a positive value means 
that noun attachment is predicted. Note that all the 
weak hypotheses chosen on the first rounds test the 
value of a single attribute: boosting starts with gen- 
eral tendencies and moves toward less widely ap- 
plicable but higher-precision tests as it proceeds. 
In 20,000 rounds of boosting, single-attribute tests 
were chosen 4,615 times, two-attribute tests were 
chosen 4,146 ,times, three-attribute tests were cho- 
sen 2,779 times, and four-attribute tests were cho- 
sen 8,460 times. It is possible for the same predi- 
cate to be chosen in multiple rounds; in fact, pred- 
icates were chosen about twice on average. The fi- 
nal hypothesis considers 9,677 distinct predicates. 
We can define the total weight of a predicate to be 
the sum of Pl'S over the rounds in which it is cho- 
sen; this represents how big a vote the predicate has 
on examples it applies to. We expect more-specific 
hypotheses to have more weight--otherwise they 
would not be able to overrule more-general hy- 
potheses, and there would be no point in having 
them. This is confirmed by examining the predi- 
cates with the greatest weight (in absolute value) af- 
ter 20,000 rounds of boosting, as shown in Table 6. 
After 20,000 rounds of boosting the test error 
was down to 14.6 ± 0.6%. This is indistinguish- 
able from the best known results for this problem, 
namely, 14.5±0.6%, reported by Collins and Brook 
on exactly the same data. In Figure 2, we show the 
training and test error as a function of the number 
of rounds of boosting. The boosted classifier has 
the advantage of being much more compact than the 
large decision list built by Collins and Brooks using 
a back-off method. We also did not take into ac- 
count the linguistic knowledge used by Collins and 
Brooks who, for instance, disallowed tests that ig- 
nore the preposition. 
Compared to maximum entropy methods (Ratna- 
parkhi et al., 1994), although the methods share a 
similar structure, the boosted classifier achieves an 
error rate which is significantly lower. 
5 Using boosting to improve data quality 
The importance weights that boosting assigns to 
training examples are very useful for improving data 
quality. Mislabelled examples resulting from anno- 
tator errors tend to be hard examples to classify cor- 
rectly; hence they tend to have large weight in the 
43 
prev word tagged word next word 
(V = was, N1 = decision, P ~- of, N2 = People) 
( V = put, N1 = them, P = on, N2 = streets) 
(V = making, N1 = it, t 9 = in, N2 = terms) 
(V = prompted, N1 = speculation, 19 = in, N2 = market) 
(V = is, N1 = director, 19 =- at, N2 = Bank) 
Prediction 
+25.41 
-23.08 
-22.89 
+25.76 
+23.83 
Table 6: The five weak hypotheses with the highest (absolute) weight after 20,000 rounds. 
be 
Big 
only 
at 
of 
some 
" To 
with the 
" the 
- and 
for most 
We have 
- and 
-- a 
by A 
<P> But 
- and 
I were 
n't make 
have thought 
will have 
the first 
be involved 
A 's 
including as 
Half 
I were 
in both 
, said 
to one 
to one 
" the 
to long-term 
have called 
have called 
with the 
was his 
have thought 
3O % 
of have 
new 
'S 
in 
what 
out 
the 
by 
corpus label 
NN 
JJ 
NN 
JJ 
JJ 
VBN 
correct label 
TO 
DT 
DT 
CC 
JJS 
VBP 
to 
big 
in 
much 
the 
out 
gold 
to 
'S 
'S 
only 
JJ CC 
IN DT 
NNP DT 
IN CC 
NN CC 
VB VBP 
VBP VB 
VBD VBN 
Test 
VBP VB 
RB JJ 
JJ VBN 
NNP POS 
JJ RB 
DT PDT 
VB VBP 
CC (DT) 
VBN 
NN PRP 
NN PRP 
NN PRP 
for 
for 
Big 
before 
by 
more 
and 
NN 
VBD 
VBD 
JJ 
PRP 
VBD 
JJ 
JJ 
(RB) 
VBN 
VBN 
DT 
PRP$ 
VBN 
NN 
Table 7: Training examples from experiment 4 with greatest weight. 
final distribution DT+i (i). If we rank examples by 
their weight in the final distribution, mislabelled ex- 
amples tend to cluster near the top. 
Table 7 shows the training examples with the 
greatest weight in tagging experiment 4. All but 
two represent annotator errors, and one of the two 
44 
non-errors is a highly unusual construction ("a lot 
of have and have-not markets"). Table 8 similarly 
illustrates the highest-weight examples from the PP- 
attachment data. Many of these are errors, though 
others are genuinely difficult to judge. 
v N~ P 
rose NUMBER to 
dropped NUMBER to 
added NUMBER to 
gained NUMBER to 
gained NUMBER to 
jumped NUMBER to 
reported earnings of 
had sales of 
lost NUMBER to 
lost NUMBER to 
lost NUMBER to 
earned million on 
outnumbered NUMBER to 
had change in 
had change in 
posted drop in 
yielding PERCENT to 
posted loss for 
raise billion in 
is reporter in 
yield PERCENT in 
yield PERCENT in 
have impact on 
posted drop in 
registered NUMBER on 
auction million in 
following decline in 
reported earnings for 
signed agreement with 
have impact on 
report earnings for 
fell NUMBER to 
buy stake in 
report loss for 
make ,payments on 
took charge in 
is writer in 
earned million on 
earned million on 
reached agreement in 
reached agreement in 
started venture with 
resolve disputes with 
become shareholder in 
reach agreement with 
Table 8: High-weight examples 
attachment data. The last column 
that appears in the corpus. 
N2 
NUMBER N 
NUMBER N 
NUMBER N 
NUMBER N 
NUMBER N 
NUMBER N 
million V 
million V 
NUMBER N 
NUMBER N 
NUMBER N 
revenue N 
NUMBER V 
earnings V 
earnings V 
profit V 
assumption N 
quarter V 
cash V 
bureau" V 
NUMBER N 
NUMBER N 
market V 
earnings V 
scale N 
maturity V 
August V 
quarter V 
Inc. V 
results N 
quarter N 
point N 
Airlines V 
quarter N 
debt V 
quarter N 
York V 
sales N 
sales N 
principle V 
principle V 
Co. N 
company V 
bank V 
regulators V 
from the PP- 
gives the label 

References

Eric Brill and Jun Wu. 1998. Classifier combination for 
improved lexical disambiguation. In Proceedings of 
COLING-A CL. 

Eric Brill. 1993. Transformation-Based Learning. 
Ph.D. thesis, Univ. of Pennsylvania. 

Michael Collins and James Brooks. 1995. Prepositional 
phrase attachment through a backed-off model. In 
Proceedings of the Third Workshop on Very Large 
Corpora. 

Yoav Freund and Robert E. Schapire. 1997. A decision- 
theoretic generalization of on-line learning and an ap- 
plication to boosting. Journal of Computer and Sys- 
tem Sciences, 55(1): 119-139, August. 

David Magerman. 1995. Statistical decision-tree models 
for parsing. In Proc. ACL-95. 

Mitchell Marcus, Beatrice Santorini, and Mary Ann 
Marcinkiewicz. 1993. Building a large annotated cor- 
pus of English: The Penn Treebank. Computational 
Linguistics, 19(2):313-330. 

A. Ratnaparkhi, J. Renyar, and S. Roukos. 1994. A max- 
imum entropy model for prepostional phrase attache- 
ment. In Proceedings of the ARPA Workshop on Hu- 
man Language Technology. 

Adwait Ratnaparkhi. 1996. A maximum entropy part- 
of-speech tagger. In Proceedings of the Empirical 
Methods in Natural Language Processing Conference. 

Robert E. Schapire and Yoram Singer. 1998. Improved 
boosting algorithms using confidence-rated predic- 
tions. In Proceedings of the Eleventh Annual Confer- 
ence on Computational Learning Theory, pages 80-91. 

Robert E. Schapire, Yoav Freund, Peter Bartlett, and 
Wee Sun Lee. 1997. Boosting the margin: A new 
explanation for the effectiveness of voting methods. 
In Machine Learning: Proceedings of the Fourteenth 
International Conference. 

Robert E. Schapire. 1997. Using output codes to boost 
multiclass learning problems. In Machine Learning: 
Proceedings of the Fourteenth International Confer- 
ence. 

Ralph Weischedel et al. 1993. Coping with ambigu- 
ity and unknown words through probabilistic models. 
Computational Linguistics, 19(2):359-382. 

E. Brill and E Resnik. 1994. A rule-baed appraoch to 
prepositional phrase attachment disambiguation. In 
Proceedings of the fifteenth international conference 
on computational linguistics ( COLING).
