Exploiting Diversity in Natural Language Processing: 
Combining Parsers 
John C. Henderson and Eric Brill 
Department of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218 
{jhndrsn,brill}@cs.jhu.edu 
Abstract 
Three state-of-the-art statistical parsers are com- 
bined to produce more accurate parses, as well as 
new bounds on achievable Treebank parsing accu- 
racy. Two general approaches are presented and two 
combination techniques are described for each ap- 
proach. Both parametric and non-parametric mod- 
els are explored, i The resulting parsers surpass the 
best previously published performance results for 
the Penn Treebank. 
1 Introduction 
The natural language processing community is in the 
strong position of having many available approaches 
to solving some of its most fundamental problems. 
The machine learning community has been in a simi- 
lar situation and has studied the combination of mul- 
tiple classifiers (Wolpert, 1992; Heath et al., 1996). 
Their theoretical I finding is simply stated: classifica- 
tion error rate decreases toward the noise rate ex- 
ponentially in the number of independent, accurate 
classifiers. The theory has also been validated em- 
pirically. 
Recently, combination techniques have been in- 
vestigated for part of speech tagging with positive 
results (van Halteren et al., 1998; Brill and Wu, 
1998). In both cases the investigators were able to 
achieve significant improvements over the previous 
best tagging results. Similar advances have been 
made in machine translation (Frederking and Niren- 
burg, 1994), speech recognition (Fiscus, 1997) and 
named entity recognition (Borthwick et al., 1998). 
The corpus-based statistical parsing community 
has many fast and accurate automated parsing sys- 
tems, including systems produced by Collins (1997), 
Charniak (1997) and Ratnaparkhi (1997). These 
three parsers have given the best reported parsing 
results on the Penn Treebank Wall Street Journal 
corpus (Marcus et al., 1993). We used these three 
parsers to explore parser combination techniques. 
2 Techniques for Combining Parsers 
2.1 Parse Hybridization 
We are interested in combining the substructures of 
the input parses to produce a better parse. We 
call this approach parse hybridization. The sub- 
structures that are unanimously hypothesized by the 
parsers should be preserved after combination, and 
the combination technique should not foolishly cre- 
ate substructures for which there is no supporting 
evidence. These two principles guide experimenta- 
tion in this framework, and together with the evalu- 
ation measures help us decide which specific type of 
substructure to combine. 
The precision and recall measures (described in 
more detail in Section 3) used in evaluating Tree- 
bank parsing treat each constituent as a separate 
entity, a minimal unit of correctness. Since our goal 
is to perform well under these measures we will simi- 
larly treat constituents as the minimal substructures 
for combination. 
2.1.1 Constituent Voting 
One hybridization strategy is to let the parsers vote 
on constituents' membership in the hypothesized 
set. If enough parsers suggest that a particular con- 
stituent belongs in the parse, we include it. We call 
this technique constituent voting. We include a con- 
stituent in our hypothesized parse if it appears in the 
output of a majority of the parsers. In our particular 
case the majority requires the agreement of only two 
parsers because we have only three. This technique 
has the advantage of requiring no training, but it 
has the disadvantage of treating all parsers equally 
even though they may have differing accuracies or 
may specialize in modeling different phenomena. 
2.1.2 Naive Bayes 
Another technique for parse hybridization is to use 
a naive Bayes classifier to determine which con- 
stituents to include in the parse. The development of 
a naive Bayes classifier involves learning how much 
each parser should be trusted for the decisions it 
makes. Our original hope in combining these parsers 
is that their errors are independently distributed. 
This is equivalent to the assumption used in proba- 
187 
bility estimation for naive Bayes classifiers, namely 
that the attribute values are conditionally indepen- 
dent when the target value is given. For this reason, 
naive Bayes classifiers are well-matched to this prob- 
lem. 
In Equations 1 through 3 we develop the model 
for constructing our parse using naive Bayes classi- 
fication. C is the union of the sets of constituents 
suggested by the parsers. It(c) is a binary function 
returning t (for true) precisely when the constituent 
c 6 C should be included in the hypothesis. Mi(c) 
is a binary function returning t when parser i (from 
among the k parsers) suggests constituent c should 
be in the parse. The hypothesized parse is then the 
set of constituents that are likely (P > 0.5) to be in 
the parse according to this model. 
argmax P(Tr(c)\[M1 (c)... Mk (c) ) 
= argmax P(Ml(C)'"Mk(e)lTr(c))P(Tr(c)).(1) 
,(c) P(M1 (c)... Mk (c)) 
k = argmaxP(rr(c))II P(Mi(c)lTr(c)) 
(2) 
~(c) i=1 P(Mi(c)) 
k 
= argmaxP(?r(c)) II P(Mi(c)l~r(c)) (3) 
~r(c) i:1 
The estimation of the probabilities in the model is 
carried out as shown in Equation 4. Here N(.) 
counts the number of hypothesized constituents in 
the development set that match the binary predi- 
cate specified as an argument. 
k 
P(~-(c) = t) II P(M'(c) l~r(c) = t) 
i=1 
Y(Tr(c) = t) ~ N(Mi(c),r(c) = t) 
- ICl ~ N~ ----~ (4) i=1 
2.1.3 Lemma: No Crossing Brackets 
Under certain conditions the constituent voting and 
naive Bayes constituent combination techniques are 
guaranteed to produce sets of constituents with no 
crossing brackets. There are simply not enough 
votes remaining to allow any of the crossing struc- 
tures to enter the hypothesized constituent set. 
Lemma: If the number of votes required by con- 
stituent voting is greater than half of the parsers 
under consideration the resulting structure has no 
crossing constituents. 
Proof: Assume a pair of crossing constituents ap- 
pears in the output of the constituent voting tech- 
nique using k parsers. Call the crossing constituents 
A and B. A receives a votes, and B receives b votes. 
Each of the constituents must have received at least 
\[~--~-~q votes from the k parsers, so a > \[k2~ \] and 2 / 
b > F~.2_!l Let s = a + b. None of the parsers pro- _ / 2 /" 
duce parses with crossing brackets, so none of them 
votes for both of the assumed constituents. Hence, 
s _< k. But by addition of the votes on the two 
parses, s > 9.F~.t.11 > k, a contradiction. • ---~ 2 ! 
Similarly, when the naive Bayes classifier is con- 
figured such that the constituents require estimated 
probabilities strictly larger than 0.5 to be accepted, 
there is not enough probability mass remaining on 
crossing brackets for them to be included in the hy- 
pothesis. 
2.2 Parser Switching 
In general, the lemma of the previous section does 
not ensure that all the productions in the combined 
parse are found in the grammars of the member 
parsers. There is a guarantee of no crossing brackets 
but there is no guarantee that a constituent in the 
tree has the same children as it had in any of the 
three original parses. One can trivially create sit- 
uations in which strictly binary-branching trees are 
combined to create a tree with only the root node 
and the terminal nodes, a completely flat structure. 
This drastic tree manipulation is not appropriate 
for situations in which we want to assign particu- 
lar structures to sentences. For example, we may 
have semantic information (e.g. database query op- 
erations) associated with the productions in a gram- 
mar. If the parse contains productions from outside 
our grammar the machine has no direct method for 
handling them (e.g. the resulting database query 
may be syntactically malformed). 
We have developed a general approach for combin- 
ing parsers when preserving the entire structure of 
a parse tree is important. The combining algorithm 
is presented with the candidate parses and asked to 
choose which one is best. The combining technique 
must act as a multi-position switch indicating which 
parser should be trusted for the particular sentence. 
We call this approach parser switching. Once again 
we present both a non-parametric and a parametric 
technique for this task. 
2.2.1 Similarity Switching 
First we present the non-parametric version of parser 
switching, similarity switching: 
1. From each candidate parse, 7ri, for a sentence, 
create the constituent set Si by converting each 
constituent into its tuple representation. 
2. The score for rri is ~ IS# N Sil, where j ranges 
#¢i 
over the candidate parses for the sentence. 
3. Switch to (use) the parser with the highest score 
for the sentence. Ties are broken arbitrarily. 
The intuition for this technique is that we can 
measure a similarity between parses by counting the 
188 
constituents they have in common. We pick the 
parse that is most similar to the other parses by 
choosing the one with the highest sum of pairwise 
similarities. This is the parse that is closest to the 
centroid of the Observed parses under the similarity 
metric. 
2.2.2 NaYve Bayes 
The probabilistic version of this procedure is 
straightforward: We once again assume indepen- 
dence among our various member parsers. Further- 
more, we know one of the original parses will be the 
hypothesized parse, so the direct method of deter- 
mining which one is best is to compute the proba- 
bility of each of the candidate parses using the prob- 
abilistic model iwe developed in Section 2.1. We 
model each parse as the decisions made to create 
it, and model those decisions as independent events. 
Each decision determines the inclusion or exclusion 
of a candidate constituent. The set of candidate 
constituents comes from the union of all the con- 
stituents suggested by the member parsers. This 
is summarized in Equation 5. The computation of 
P(Tri(c)lM1...Mk(c)) has been sketched before in 
Equations 1 th~:ough 4. In this case we are inter- 
ested in findingl the maximum probability parse, ~'~, 
and Mi is the s:et of relevant (binary) parsing deci- 
sions made by parser i. 7r~ is a parse selected from 
among the outputs of the individual parsers. It is 
chosen such that the decisions it made in including 
or excluding constituents are most probable under 
the models for all of the parsers. 
argmax P(~'i\[M1 • • • Mk) 
7rl 
= argmaxHPbr, Cc)iM1Cc)...Mk(c)) (5) 
7fi C • 
i 3 Experiments 
The three parsers were trained and tuned by their 
creators on various sections of the WSJ portion of 
the Penn Treebank, leaving only sections 22 and 23 
completely untouched during the development of any 
of the parsers, i We used section 23 as the develop- 
ment set for our combining techniques, and section 
22 only for final testing. The development set con- 
talned 44088 constituents in 2416 sentences and the 
test set contained 30691 constituents in 1699 sen- 
tences. A sentence was withheld from section 22 
because its exireme length was troublesome for a 
couple of the parsers) 
The standard measures for evaluating Penn Tree- 
bank parsing performance are precision and recall of 
the predicted Constituents. Each parse is converted 
into a set of constituents represented as a tuples: 
1The sentence: in question was more than 100 words in 
length and included nested quotes and parenthetical expres- 
sions. 
(label, start, end). The set is then compared with 
the set generated from the Penn Treebank parse to 
determine the precision and recall. Precision is the 
portion of hypothesized constituents that are cor- 
rect and recall is the portion of the Treebank con- 
stituents that are hypothesized. 
For our experiments we also report the mean of 
precision and recall, which we denote by (P + R)/2 
and F-measure. F-measure is the harmonic mean of 
precision and recall, 2PR/(P + R). It is closer to 
the smaller value of precision and recall when there 
is a large skew in their values. 
We performed three experiments to evaluate our 
techniques. The first shows how constituent features 
and context do not help in deciding which parser 
to trust. We then show that the combining tech- 
niques presented above give better parsing accuracy 
than any of the individual parsers. Finally we show 
the combining techniques degrade very little when a 
poor parser is added to the set. 
3.1 Context 
It is possible one could produce better models by in- 
troducing features describing constituents and their 
contexts because one parser could be much better 
than the majority of the others in particular situa- 
tions. For example, one parser could be more ac- 
curate at predicting noun phrases than the other 
parsers. None of the models we have presented uti- 
lize features associated with a particular constituent 
(i.e. the label, span, parent label, etc.) to influence 
parser preference. This is not an oversight. Fea- 
tures and context were initially introduced into the 
models, but they refused to offer any gains in per- 
formance. While we cannot prove there are no such 
useful features on which one should condition trust, 
we can give some insight into why the features we 
explored offered no gain. 
Because we are working with only three parsers, 
the only situation in which context will help us is 
when it can indicate we should choose to believe a 
single parser that disagrees with the majority hy- 
pothesis instead of the majority hypothesis itself. 
This is the only important case, because otherwise 
the simple majority combining technique would pick 
the correct constituent. One side of the decision 
making process is when we choose to believe a con- 
stituent should be in the parse, even though only 
one parser suggests it. We call such a constituent an 
isolated constituent. If we were working with more 
than three parsers we could investigate minority con- 
stituents, those constituents that are suggested by 
at least one parser, but which the majority of the 
parsers do not suggest. 
Adding the isolated constituents to our hypothe- 
sis parse could increase our expected recall, but in 
the cases we investigated it would invariably hurt 
our precision more than we would gain on recall. 
189 
Consider for a set of constituents the isolated con- 
stituent precision parser metric, the portion of iso- 
lated constituents that are correctly hypothesized. 
When this metric is less than 0.5, we expect to in- 
cur more errors 2 than we will remove by adding those 
constituents to the parse. 
We show the results of three of the experiments we 
conducted to measure isolated constituent precision 
under various partitioning schemes. In Table 1 we 
see with very few exceptions that the isolated con- 
stituent precision is less than 0.5 when we use the 
constituent label as a feature. The counts represent 
portions of the approximately 44000 constituents hy- 
pothesized by the parsers in the development set. 
In the cases where isolated constituent precision is 
larger than 0.5 the affected portion of the hypotheses 
is negligible. 
Similarly Figures 1 and 2 show how the iso- 
lated constituent precision varies by sentence length 
and the size of the span of the hypothesized con- 
stituent. In each figure the upper graph shows the 
isolated constituent precision and the bottom graph 
shows the corresponding number of hypothesized 
constituents. Again we notice that the isolated con- 
stituent precision is larger than 0.5 only in those 
partitions that contain very few samples. From this 
we see that a finer-grained model for parser combi- 
nation, at least for the features we have examined, 
will not give us any additional power. 
3.2 Performance Testing 
The results in Table 2 were achieved on the develop- 
ment set. The first two rows of the table are base- 
lines. The first row represents the average accuracy 
of the three parsers we combine. The second row 
is the accuracy of the best of the three parsers. 3 
The next two rows are results of oracle experiments. 
The parser switching oracle is the upper bound on 
the accuracy that can be achieved on this set in the 
parser switching framework. It is the performance 
we could achieve if an omniscient observer told us 
which parser to pick for each of the sentences. The 
maximum precision row is the upper bound on accu- 
racy if we could pick exactly the correct constituents 
from among the constituents suggested by the three 
parsers. Another way to interpret this is that less 
than 5% of the correct constituents are missing from 
the hypotheses generated by the union of the three 
parsers. The maximum precision oracle is an upper 
bound on the possible gain we can achieve by parse 
hybridization. 
2This is in absolute terms, total errors being the sum of 
precision errors and recall errors. 
3The identity of this parser is not given, nor is the iden- 
tity disclosed for the results of any of the individual parsers. 
We do not aim to compare the performance of the individual 
parsers, nor do we want to bias further research by giving the 
individual parser results for the test set. 
Parser Sentences 
Parser 1 279 
Parser 2 216 
Parser 3 1204 
% 
16 
13 
71 
Table 4: Bayes Switching Parser Usage 
We do not show the numbers for the Bayes models 
in Table 2 because the parameters involved were es- 
tablished using this set. The precision and recall of 
similarity switching and constituent voting are both 
significantly better than the best individual parser, 
and constituent voting is significantly better than 
parser switching in precision. 4 Constituent voting 
gives the highest accuracy for parsing the Penn Tree- 
bank reported to date. 
Table 3 contains the results for evaluating our sys- 
tems on the test set (section 22). All of these systems 
were run on data that was not seen during their de- 
velopment. The difference in precision between sim- 
ilarity and Bayes switching techniques is significant, 
but the difference in recall is not. This is the first 
set that gives us a fair evaluation of the Bayes mod- 
els, and the Bayes switching model performs signif- 
icantly better than its non-parametric counterpart. 
The constituent voting and naive Bayes techniques 
are equivalent because the parameters learned in the 
training set did not sufficiently discriminate between 
the three parsers. 
Table 4 shows how much the Bayes switching tech- 
nique uses each of the parsers on the test set. Parser 
3, the most accurate parser, was chosen 71% of the 
time, and Parser 1, the least accurate parser was cho- 
sen 16% of the time. Ties are rare in Bayes switch- 
ing because the models are fine-grained - many es- 
timated probabilities are involved in each decision. 
3.3 Robustness Testing 
In the interest of testing the robustness of these com- 
bining techniques, we added a fourth, simple non- 
lexicalized PCFG parser. The PCFG was trained 
from the same sections of the Penn Treebank as the 
other three parsers. It was then tested on section 
22 of the Treebank in conjunction with the other 
parsers. 
The results of this experiment can be seen in Ta- 
ble 5. The entries in this table can be compared with 
those of Table 3 to see how the performance of the 
combining techniques degrades in the presence of an 
inferior parser. As seen by the drop in average indi- 
vidual parser performance baseline, the introduced 
parser does not perform very well. The average in- 
dividual parser accuracy was reduced by more than 
5% when we added this new parser, but the preci- 
4All significance claims are made based on a binomial hy- 
pothesis test of equality with an a < 0.01 confidence level. 
190 
Constituent Parser1 Parser2 Parser3 
Label count Precision count Precision count Precision 
ADJP 
ADVP 
CONJP 
' FRAG 
:INTJ 
LST 
NAC 
NP 
'NX 
PP 
PRN 
PRT 
QP 
RRC 
S 
SBAR 
SBARQ 
SINV 
SQ 
UCP 
VP 
WHADJP 
WHADVP 
WHNP 
WHPP 
X 
132 28.78 
150 25.33 
2 50.00 
51 3.92 
3 66.66 
0 NA 
0 NA 
1489 21.08 
7 85.71 
732 23.63 
20 55.0O 
12 16.66 
21 38.09 
1 0.00 
757 13.73 
331 ii.78 
0 NA 
3 66.66 
2 0 
6 16.66 
868 13.36 
0 NA 
2 100.00 
33 33.33 
0 NA 
0 NA 
215 21.86 
129 21.70 
8 37.50 
29 27.58 
1 100.00 
0 NA 
13 53.84 
1550 18.38 
9 22.22 
643 20.06 
33 54.54 
20 40.00 
34 44.11 
1 0.00 
482 23.65 
196 23.97 
6 16.66 
11 81.81 
11 18.18 
12 8.33 
630 24.12 
0 NA 
5 40.00 
8 25.00 
0 NA 
2 100.00 
173 34.10 
102 31.37 
3 0.00 
11 9.09 
2 50.00 
0 NA 
7 14.28 
1178 27.33 
3 0.00 
503 27.83 
38 15.78 
16 37.50 
76 14.47 
2 0.00 
434 38.94 
178 34.83 
3 0.00 
13 30.76 
3 33.33 
8 12.50 
477 35.42 
1 0.00 
1 100.00 
17 58.82 
2 100.00 
1 0.00 
Table 1: Isolated Constituent Precision By Constituent Label 
Reference / System P R (P+R)/2 F 
Average Individual Parser 87.14 86.91 87.02 87.02 
Best Individual Parser 88.73 88.54 88.63 88.63 
Parser Switching Oracle 93.12 92.84 92.98 92.98 
Maximum Precision Oracle 100.00 95.41 97.70 97.65 
Similarity Switching 89.50 89.88 89.69 89.69 
Constituent Voting 92.09 89.18 90.64 90.61 
Table 2: Summary of Development Set Performance 
Reference / System P R (P+R)/2 F 
Average Individual Parser 87.61 87.83 87.72 87.72 
Best Individual Parser 89.61 89.73 89.67 89.67 
Parser Switching Oracle 93.78 93.87 93.82 93.82 
Maximum Precision Oracle 100.00 95.91 97.95 97.91 
Similarity Switching 90.04 90.81 90.43 90.43 
Bayes Switching 90.78 90.70 90.74 90.74 
Constituent Voting 92.42 90.10 91.26 91.25 
Naive Bayes 92.42 90.10 91.26 91.25 
Table 3: Test Set Results 
191 
6 
6 
z 
lO0 
80 
60 
40 
20 
0 
0 
800 
700 
600 
500 
400 
300 
200 
100 
0 
0 
! 
Parser I J 
Parser 2 ---x--- 
Parser 3 ---~--- 
• ., ,. 
: ",,.• 
"••• ...~... ...~ ....... ~.,'" "•. ,,," 
/ 
I I I I I I ~- 
10 20 30 40 50 60 
lO 
/ \ : \ 
/i 
• ,,/' 
/.,. ;.. "', '\ 
""',,..~.. ; \'\ 
% 
I I I I I 
20 30 40 50 60 
Sentence Length 
Pa~ser 1 * 
Parser 2 ---x--- 
Parser 3 ---~--- 
70 
70 
Figure 1: Isolated Constituent Precision and Sentence Length 
sion of the constituent voting technique was the only 
result that decreased significantly. The Bayes mod- 
els were able to achieve significantly higher preci- 
sion than their non-parametric counterparts. We see 
from these results that the behavior of the paramet- 
ric techniques are robust in the presence of a poor 
parser• Surprisingly, the non-parametric switching 
technique also exhibited robust behaviour in this sit- 
uation. 
4 Conclusion 
We have presented two general approaches to study- 
ing parser combination: parser switching and parse 
hybridization. For each experiment we gave an non- 
192 
o 
g 
a 
g 
g 
z 
6O 
40 
20 
0 
0 
1600 
1200 
g00 
600 
400 
200 
s t 
! / 
t 
,,° ¢ 
/ I / 
/ I 
/ 
: l / t 
" A; / 
/ : ! 
• . o 
~-- . "'" ......... ~ ........ : i / 
..... ~ .............. ~ ......... i il 
.... -X... .~.-X 
10 20 30 40 
I i 
t / 
I I 
/ 
/ / 
/ ! 
1 / 
! 
/ 
Parser 1 i 
Parser 2 ---x--- 
Parser 3 ---~--- 
\] I 
50 
i Parser I i 
Parser 2 ---x--- 
Parser 3 ---~--- 
""",~ \ 
\\ 
\',\\ 
~1~ . ° 
60 
l0 20 30 40 50 60 
Constituent Span Length (Tokens) 
Figure 2: Isolated Constituent Precision and Span Length 
parametric and a parametric technique for combin- 
ing parsers: All four of the techniques studied result 
in parsing systems that perform better than any pre- 
viously reported. Both of the switching techniques, 
as well as the parametric hybridization technique 
were also shown to be robust when a poor parser was 
introduced into the experiments. Through parser 
combination we have reduced the precision error rate 
by 30% and the recall error rate by 6% compared to 
the best previously published result. 
Combining multiple highly-accurate independent 
parsers yields promising results. We plan to explore 
more powerful techniques for exploiting the diversity 
of parsing methods. 
193 
Reference / System P R (P+R)/2 F 
Average Individual Parser 84.55 80.91 82.73 82.69 
Best Individual Parser 89.61 89.73 89.67 89.67 
Parser Switching Oracle 93.92 93.88 93.90 93.90 
Maximum Precision Oracle 100.00 96.66 98.33 98.30 
Similarity Switching 89.90 90.89 90.40 90.39 
Bayes Switching 90.94 90.70 90.82 90.82 
Constituent Voting 89.78 91.80 90.79 90.78 
Naive Bayes 92.42 90.10 91.26 91.25 
Table 5: Robustness Test Results 
5 Acknowledgements 
We would like to thank Eugene Charniak, Michael 
Collins, and Adwait Ratnaparkhi for enabling all of 
this research by providing us with their parsers and 
helpful comments. 
This work was funded by NSF grant IRI-9502312. 
Both authors are members of the Center for Lan- 
guage and Speech Processing at Johns Hopkins Uni- 
versity. 

References 

Andrew Borthwick, John Sterling, Eugene 
Agichtein, and Ralph Grishman. 1998. Ex- 
ploiting diverse knowledge sources via maximum 
entropy in named entity recognition. In Eugene 
Charniak, editor, Proceedings of the Sixth Work- 
shop on Very Large Corpora, pages 152-160, 
Montreal. 

Eric Brill and Jun Wu. 1998. Classifier combination 
for improved lexical combination. In Proceedings 
of the 17th International Conference on Compu- 
tational Linguistics. 

Eugene Charniak. 1997. Statistical parsing with 
a context-free grammar and word statistics. In 
Proceedings of the Fourteenth National Confer- 
ence on Artificial Intelligence, Menlo Park. AAAI 
Press/MIT Press. 

Michael Collins. 1997. Three generative, lexicalised 
models for statistical parsing. In Proceedings of 
the Annual Meeting of the Association of Compu- 
tational Linguistics, volume 35, Madrid. 

Jonathan G. Fiscus. 1997. A post-processing sys- 
tem to yield reduced word error rates: Recognizer 
output voting error reduction (ROVER). In Eu- 
roSpeech 1997 Proceedings, volume 4, pages 1895-1898. 

Robert Frederking and Sergei Nirenburg. 1994. 
Three heads are better than one. In Proceedings of 
the 4th Conference on Applied Natural Language 
Processing, pages 95-100, Stuttgart, Germany. 

David Heath, Simon Kasif, and Steven Salzberg. 
1996. Committees of decision trees. In 
B. Gorayska and J. Mey, editors, Cognitive 
Technology: In Search of a Humane Inter- 
face, pages 305-317. Elsevier Science B.V., 
Amsterdam. 

Mitchell P. Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building a large 
annotated corpus of english: The Penn Treebank. 
Computational Linguistics, 19(2):313-330. 

Adwait Ratnaparkhi. 1997. A linear observed time 
statistical parser based on maximum entropy 
models. In Second Conference on Empirical Meth- 
ods in Natural Language Processing, Providence, 
R.I. 

Hans van Halteren, Jakub Zavrel, and Walter Daele- 
mans. 1998. Improving data driven wordclass tag- 
ging by system combination. In Proceedings of the 
17th International Conference on Computational 
Linguistics, Montreal. 

David H. Wolpert. 1992. Stacked generalization. 
Neural Networks, 5:241-259. 
