Corpus Variation and Parser Performance
Daniel Gildea
University of California, Berkeley, and
International Computer Science Institute
gildea@cs.berkeley.edu
Abstract
Most work in statistical parsing has focused on a
single corpus: the Wall Street Journal portion of the
Penn Treebank. While this has allowed for quanti-
tative comparison of parsing techniques, it has left
open the question of how other types of text might
a#0Bect parser performance, and how portable pars-
ing models are across corpora. We examine these
questions by comparing results for the Brown and
WSJ corpora, and also consider which parts of the
parser's probability model are particularly tuned to
the corpus on which it was trained. This leads us
to a technique for pruning parameters to reduce the
size of the parsing model.
1 Introduction
The past several years have seen great progress in
the #0Celdof naturallanguageparsing,through the use
of statistical methods trained using large corpora of
hand-parsed training data. The techniques of Char-
niak #281997#29, Collins #281997#29, and Ratnaparkhi #281997#29
achieved roughly comparable results using the same
sets of training and test data. In each case, the cor-
pus used was the Penn Treebank's hand-annotated
parses of Wall Street Journal articles. Relatively
few quantitative parsing results have been reported
on other corpora #28though see Stolckeetal. #281996#29
for results on Switchboard, as well as Collins et
al. #281999#29 for results on Czech and Hwa #281999#29 for
bootstrapping from WSJ to ATIS#29. The inclusion of
parses for the Brown corpus in the Penn Treebank
allows us to compare parser performance across cor-
pora. In this paper we examine the following ques-
tions:
#0F To what extent is the performance of statistical
parsers on the WSJ task due to its relatively
uniform style, and howmightsuch parsers fare
on the more varied Brown corpus?
#0F Can training data from one corpus be applied
to parsing another?
#0F What aspects of the parser's probabilitymodel
are particularly tuned to one corpus, and which
are more general?
Our investigation of these questions leads us to
a surprising result about parsing the WSJ corpus:
over a third of the model's parameters can be elim-
inated with little impact on performance. Aside
from cross-corpus considerations, this is an impor-
tant #0Cnding if a lightweightparseris desiredor mem-
ory usage is a consideration.
2 Previous Comparisons of Corpora
A great deal of work has been done outside of the
parsingcommunityanalyzingthe variationsbetween
corpora and di#0Berent genres of text. Biber #281993#29
investigated variation in a number syntactic fea-
tures over genres, or registers, of language. Of
particular importance to statistical parsers is the
investigation of frequencies for verb subcategoriza-
tions suchasRolandandJurafsky #281998#29. Roland
et al. #282000#29 #0Cnd that subcategorization frequen-
cies for certain verbs vary signi#0Ccantly between the
Wall Street Journal corpus and the mixed-genre
Brown corpus, but that they vary less so between
genre-balanced British and American corpora. Ar-
gument structure is essentially the task that auto-
matic parsers attempt to solve, and the frequencies
of various structures in training data are re#0Dected in
a statistical parser's probability model. The varia-
tion in verb argument structure found by previous
research caused us to wonder to what extent a model
trained on one corpus would be useful in parsing an-
other. The probability models of modern parsers
include not only the number and syntactic type of
a word's arguments, but lexical information about
their #0Cllers. Although wearenotaware of previous
comparisons of the frequencies of argument #0Cllers,
we can only assume that they vary at least as much
as the syntactic subcategorization frames.
3 The Parsing Model
We take as our baseline parser the statistical model
of Model 1 of Collins#281997#29. The model is a history-
based, generativemodel, in which the probabilityfor
a parse tree is found by expanding each node in the
tree in turn into its child nodes, and multiplying the
probabilitiesfor each action in the derivation. It can
be thoughtofasavariety of lexicalized probabilis-
tic context-free grammar, with the rule probabilities
factored into three distributions. The #0Crst distribu-
tion gives probability of the syntactic category H
of the head child of a parent node with category
P, head word Hhw with the head tag #28the part of
speech tag of the head word#29 Hht:
P
h
#28HjP;Hht;Hhw#29
The head word and head tag of the new node H are
de#0Cned to be the same as those of its parent. The
remaining two distributions generate the non-head
children one after the other. A special #23STOP#23
symbol is generated to terminate the sequence of
children for a given parent. Each child is gener-
ated in two steps: #0Crst its syntactic category C and
head tag Chtare chosen given the parent's and head
child's features and a function #01 representing the
distance from the head child:
P
c
#28C;ChtjP;H;Hht;Hhw;#01#29
Then the new child's head word Chw is chosen:
P
cw
#28ChwjP;H;Hht;Hhw;#01;C;Cht#29
For each of the three distributions, the empiricaldis-
tribution of the training data is interpolated with
less speci#0Cc backo#0B distributions, as we will see in
Section 5. Further details of the model, including
the distance features used and special handling of
punctuation, conjunctions, and base noun phrases,
are described in Collins #281999#29.
The fundamental features of used in the proba-
bility distributions are the lexical heads and head
tags of each constituent, the co-occurrences of par-
ent nodes and their head children, and the co-
occurrences of child nodes with their head siblings
and parents. The probability models of Charniak
#281997#29, Magerman #281995#29 and Ratnaparkhi #281997#29
di#0Ber in their details but are based on similar fea-
tures. Models 2 and 3 of Collins #281997#29 add some
slightly more elaborate features to the probability
model, as do the additions of Charniak #282000#29 to
the model of Charniak #281997#29.
Our implementation of Collins' Model 1 performs
at 86#25 precision and recall of labeled parse con-
stituents on the standard Wall Street Journal train-
ing and test sets. While this does not re#0Dect
the state-of-the-art performance on the WSJ task
achieved by the more the complex models of Char-
niak #282000#29 and Collins #282000#29, we regard it as a
reasonable baseline for the investigation of corpus
e#0Bects on statistical parsing.
4 Parsing Results on the Brown
Corpus
We conducted separate experiments using WSJ
data, Brown data, and a combination of the two
as training material. For the WSJ data, we ob-
served the standard division into training #28sections
2 through 21 of the treebank#29 and test #28section 23#29
sets. For the Brown data, we reserved every tenth
sentence in the corpus as test data, using the other
nine for training. This may underestimate the dif-
#0Cculty of the Brown corpus by including sentences
from the same documents in training and test sets.
However, because of the variation within the Brown
corpus, we felt that a single contiguous test section
might not be representative. Only the subset of the
Brown corpus available in the Treebank II bracket-
ing format was used. This subset consists primarily
of various #0Cction genres. Corpus sizes are shown in
Table 1.
Training Set Test Set
Corpus Sentences Words Sentences Words
WSJ 39,832 950,028 2245 48,665
Brown 21,818 413,198 2282 38,109
Table 1: Corpus sizes. Both test sets were restricted
to sentences of 40 words or less. The Brown test
set's average sentence was shorter despite the length
restriction.
Training Data Test Set Recall Prec.
WSJ WSJ 86.1 86.6
WSJ Brown 80.3 81.0
Brown Brown 83.6 84.6
WSJ+Brown Brown 83.9 84.8
WSJ+Brown WSJ 86.3 86.9
Table 2: Parsing results by training and test corpus
Results for the Brown corpus, along with WSJ
results for comparison, are shown in Table 2. The
basic mismatch between the two corpora is shown
in the signi#0Ccantly lower performance of the WSJ-
trained model on Brown data than on WSJ data
#28rows 1 and 2#29. A model trained on Brown data only
does signi#0Ccantly better, despite the smaller size of
the training set. Combining the WSJ and Brown
training data in one model improves performance
further, but by less than 0.5#25 absolute. Similarly,
adding the Brown data to the WSJ model increased
performance on WSJ by less than 0.5#25. Thus, even
a large amount of additional data seems to have rel-
atively little impact if it is not matched to the test
material.
The more varied nature of the Brown corpus also
seems to impact results, as all the results on Brown
are lower than the WSJ result.
5 The E#0Bect of Lexical
Dependencies
The parserscitedaboveallusesomevarietyof lexical
dependency feature to capture statistics on the co-
occurrence of pairs of words being found in parent-
child relations within the parse tree. These word
pair relations, also called lexical bigrams #28Collins,
1996#29, are reminiscentof dependency grammarssuch
as Me
#13
l#15cuk #281988#29 and the link grammar of Sleator
and Temperley#281993#29. In Collins'Model 1, the word
pair statistics occur in the distribution
P
cw
#28ChwjP;H;Hht;Hhw;#01;C;Cht#29
whereHhwrepresentthe head wordof aparentnode
in the tree and Chw the head word of its #28non-head#29
child. #28The head word of a parent is the same as the
head word of its head child.#29 Because this is the only
part of the model that involves pairs of words, it is
alsowherethe bulkof theparametersarefound. The
large number of possible pairs of words in the vocab-
ulary make the training data necessarily sparse. In
order to avoid assigning zero probability to unseen
events, it is necessary to smooth the training data.
The Collins model uses linear interpolation to es-
timate probabilities from empirical distributions of
varying speci#0Ccities:
P
cw
#28ChwjP;H;Hht;Hhw;#01;C;Cht#29=
#15
1
~
P#28ChwjP;H;Hht;Hhw;#01;C;Cht#29+
#281,#15
1
#29
#10
#15
2
~
P#28ChwjP;H;Hht;#01;C;Cht#29+
#281,#15
2
#29
~
P#28ChwjCht#29
#11
#281#29
where
~
P represents the empirical distribution de-
rived directly from the counts in the training data.
The interpolation weights #15
1
, #15
2
are chosen as a
function of the number of examples seen for the con-
ditioning events and the number of unique values
seen for the predicted variable. Only the #0Crst distri-
bution in this interpolation scheme involves pairs of
words, and the third component is simply the prob-
abilityofaword given its part of speech.
Because the word pair feature is the most spe-
ci#0Cc in the model, it is likely to be the most corpus-
speci#0Cc. The vocabularies used in corpora vary, as
do the word frequencies. It is reasonable to ex-
pect word co-occurrences to vary as well. In or-
der to test this hypothesis, we removed the distribu-
tion
~
P#28ChwjP;H;Hht;Hhw;C;Cht#29 from the pars-
ingmodelentirely, relyingon the interpolationof the
two less speci#0Cc distributions in the parser:
P
cw2
#28ChwjP;H;Hht;#01;C;Cht#29=
#15
2
~
P#28ChwjP;H;Hht;#01;C;Cht#29+
#281,#15
2
#29
~
P#28ChwjCht#29 #282#29
We performed cross-corpus experiments as before
to determine whether the simpler parsing model
might be more robust to corpus e#0Bects. Results are
shown in Table 3.
Perhaps the most striking result is just how little
the eliminationof lexicalbigramsa#0Bects the baseline
system: performance on the WSJ corpus decreases
by less than 0.5#25 absolute. Moreover, the perfor-
mance of a WSJ-trained system without lexical bi-
grams on Brown test data is identical to the WSJ-
trained system with lexical bigrams. Lexical co-
occurrence statistics seem to be of no bene#0Ct when
attempting to generalize to a new corpus.
6 Pruning Parser Parameters
The relatively high performance of a parsing model
with no lexical bigram statistics on the WSJ task
led us to explore whether it might be possible to
signi#0Ccantly reduce the size of the parsing model
by selectively removing parameters without sacri-
#0Ccing performance. Such a technique reduces the
parser's memory requirements as well as the over-
head of loading and storing the model, which could
be desirable for an application where limited com-
puting resources are available.
Signi#0Ccant e#0Bort has gone into developing tech-
niques for pruning statistical language models for
speech recognition, and we borrow from this work,
using the weighted di#0Berence technique of Seymore
and Rosenfeld #281996#29. This technique applies to any
statistical model which estimates probabilities by
backing o#0B, that is, using probabilities from a less
speci#0Cc distribution when no data are available are
available for the full distribution, as the following
equations show for the general case:
P#28ejh#29= P
1
#28ejh#29 if e 62 BO#28h#29
= #0B#28h#29P
2
#28ejh
0
#29 if e 2 BO#28h#29
Here e is the event to be predicted, h is the set of
conditioning events or history, #0B is a backo#0B weight,
and h
0
is the subset of conditioning events used for
the less speci#0Cc backo#0Bdistribution. BOis the back-
o#0Bsetofevents for which no data are present in the
speci#0Cc distribution P
1
. In the case of n-gram lan-
guage modeling, e is the next word to be predicted,
and the conditioning events are the n,1 preceding
words. In our case the speci#0Cc distribution P
1
of the
backo#0B model is P
cw
of equation 1, itself a linear in-
terpolation of three empirical distributions from the
trainingdata. The less speci#0Cc distributionP
2
of the
backo#0B model is P
cw2
of equation 2, an interpolation
of two empirical distributions. The backo#0B weight #0B
is simply 1 , #15
1
in our linear interpolation model.
The Seymore#2FRosenfeld pruning technique can be
used to prune backo#0B probability models regardless
of whether the backo#0B weights are derived from lin-
ear interpolation weights or discounting techniques
such as Good-Turing. In order to ensure that the
model's probabilities still sum to one, the backo#0B
w#2F bigrams w#2Fo bigrams
Training Data Test Set Recall Prec. Recall Prec.
WSJ WSJ 86.1 86.6 85.6 86.2
WSJ Brown 80.3 81.0 80.3 81.0
Brown Brown 83.6 84.6 83.5 84.4
WSJ+Brown Brown 83.9 84.8 83.4 84.3
WSJ+Brown WSJ 86.3 86.9 85.7 86.4
Table 3: Parsing results by training and test corpus
weight #0B must be adjusted whenever a parameter is
removed from the model. In the Seymore#2FRosenfeld
approach, parameters are pruned according to the
following criterion:
N#28e;h#29#28logp#28ejh#29,logp
0
#28ejh
0
#29#29 #283#29
where p
0
#28ejh
0
#29 represents the new backed o#0B proba-
bility estimate after removingp#28ejh#29 from the model
and adjusting the backo#0B weight, and N#28e;h#29 is the
count in the training data. This criterion aims to
prune probabilities that are similar to their back-
o#0B estimates, and that are not frequently used. As
shown byStolcke #281998#29, this criterion is an approx-
imation of the relativeentropybetween the original
and pruned distributions, but does not takeinto ac-
count the e#0Bect of changing the backo#0B weight on
other events' probabilities.
Adjusting the threshold#12 below which parameters
are pruned allows us to successively remove more
and more parameters. Results for di#0Berentvalues of
#12 are shown in Table 4.
The complete parsing model derived from the
WSJ training set has 735,850 parameters in a to-
tal of nine distributions: three levels of backo#0B for
each of the three distributions P
h
, P
c
and P
cw
. The
lexical bigrams are contained in the most speci#0Cc
distribution for P
cw
. Removing all these parameters
reduces the total model size by 43#25. The results
show a gradual degradation as more parameters are
pruned.
The ten lexicalbigramswith the highest scores for
the pruning metric are shown in Table 5 for WSJ
and Table 6. The pruning metric of equation 3 has
been normalized by corpus size to allow compari-
son between WSJ and Brown. The only overlap
between the two sets is for pairs of unknown word
tokens. The WSJ bigrams are almost all speci#0Cc
to #0Cnance, are all word pairs that are likely to ap-
pear immediately adjacent to one another, and are
all children of the base NP syntactic category. The
Brown bigrams, which have lower correlation val-
ues by our metric, include verb#2Fsubject and prepo-
sition#2Fobject relations and seem more broadly ap-
plicable as a model of English. However, the pairs
are not strongly related semantically, no doubt be-
cause the #0Crst term of the pruning criterion favors
the most frequentwords, such as forms of the verbs
#5Cbe" and #5Chave".
Child word Head word Parent Pruning
Chw Hhw P Metric
New York NPB .0778
Stock Exchange NPB .0336
#3C unk #3E #3C unk #3E NPB .0313
vice president NPB .0312
Wall Street NPB .0291
San Francisco NPB .0291
York Stock NPB .0243
Mr. #3C unk #3E NPB .0241
third quarter NPB .0227
Dow Jones NPB .0227
Table 5: Ten most signi#0Ccant lexical bigrams from
WSJ, with parent category #28other syntactic context
variables not shown#29 and pruning metric
. NPB is Collins' #5Cbase NP" category.
Child word Head word Parent Pruning
Chw Hhw P Metric
It was S .0174
it was S .0169
#3C unk #3E of PP .0156
#3C unk #3E in PP .0097
course Of PP .0090
been had VP .0088
#3C unk #3E #3C unk #3E NPB .0079
they were S .0077
I 'm S .0073
time at PP .0073
Table 6: Ten most signi#0Ccant lexical bigrams from
Brown
7 Conclusion
Our results show strong corpus e#0Bects for statistical
parsing models: a small amountofmatched train-
ing data appears to be more useful than a large
amount of unmatched data. The standard WSJ
task seems to be simpli#0Ced by its homogenous style.
Adding training data from from an unmatched cor-
pus doesn'thurt, but doesn't helpagreatdealeither.
In particular, lexical bigram statistics appear to
be corpus-speci#0Cc, and our results show that they
Threshold #23 parameters #25 reduction
#12 removed model size Recall Prec.
0 #28full model#29 0 0 86.1 86.6
1 96K 13 86.0 86.4
2 166K 23 85.9 86.2
3 213K 29 85.7 86.2
1 316K 43 85.6 86.2
Table 4: Parsing results with pruned probability models. The complete parsing model contains 736K pa-
rameters in nine distributions. Removing all lexical bigram parameters reducing the size of the model by
43#25.
are of no use when attempting to generalize to new
training data. In fact, they are of surprisingly little
bene#0Ct even for matched training and test data |
removing them from the model entirely reduces per-
formance by less than 0.5#25 on the standard WSJ
parsing task. Our selective pruning technique al-
lows for a more #0Cne grained tuning of parser model
size, and would be particularly applicable to cases
where large amounts of training data are available
but memory usage is a consideration. In our im-
plementation, pruning allowed models to run within
256MB that, unpruned, required larger machines.
The parsing models of Charniak #282000#29 and
Collins #282000#29 add more complex features to the
parsing model that we use as our baseline. An
area for future work is investigation of the degree
to whichsuch features apply across corpora, or, on
the other hand, further tune the parser to the pe-
culiarities of the Wall Street Journal. Of particu-
lar interest are the automatic clusterings of lexical
co-occurrences used in Charniak #281997#29 and Mager-
man #281995#29. Cross-corpus experiments could reveal
whether these clusters uncover generally applicable
semantic categories for the parser's use.
Acknowledgments This work was undertaken as
part of the FrameNet project at ICSI, with funding
from National Science Foundation grant ITR#2FHCI
#230086132.

References
Douglas Biber. 1993. Using register-diversi#0Ced corpora for general language studies. Computational Linguistics, 19#282#29:219#7B241, June.
Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In AAAI97, Brown University, Providence, Rhode Island, August.
Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL #28NAACL#29, Seattle, Washington.
Michael Collins, Jan Hajic, Lance Ramshaw, and Christoph Tillmann. 1999. A statisticalparser for czech. In Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland.
Michael Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the ACL.
Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the ACL.
MichaelCollins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, UniversityofPennsylvania, Philadelphia.
Michael Collins. 2000. Discriminative reranking for natural language parsing. In Proceedings of the ICML.
Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland.
David Magerman. 1995. Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of the ACL.
Ivan A. Melcuk. 1988. Dependency Syntax: Theory and Practice. State University of New York Press. 
Adwait Ratnaparkhi. 1997. A linear observed time statistical parser based on maximum entropy models. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing
Douglas Roland and Daniel Jurafsky. 1998. How verb subcategorization frequencies are a#0Bected by corpus choice. In Proceedings of COLING#2FACL, pages 1122#7B1128.
Douglas Roland, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder, and Chris Riddoch. 2000. Verb subcategorization frequency di#0Berences between business-news and balanced corpora: the role of verb sense. In Proceedings of the Association for Computational Linguistics #28ACL-2000#29 Workshop on Comparing Corpora.
Kristie Seymore and Roni Rosenfeld. 1996. Scalable backo#0B language models. In ICSLP-96,volume 1, pages 232#7B235, Philadelphia. 
Daniel Sleator and Davy Temperley. 1993. Parsing english with a link grammar. In Third International Workshop on Parsing Technologies, August.
A. Stolcke, C. Chelba, D. Engle, V. Jimenez, L. Mangu, H. Printz, E. Ristad, R. Rosenfeld, D. Wu, F. Jelinek, and S. Khudanpur. 1996. Dependency language modeling. Summer Workshop Final Report 24, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, April.
Andreas Stolcke. 1998. Entropy-based pruning of backo#0B language models. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 270#7B274, Lansdowne, Va.
