APPORTIONING DEVELOPMENT EFFORT 
IN A PROBABILISTIC LR PARSING SYSTEM 
THROUGH EVALUATION 
John Carroll Ted Briscoe 
Cognitive and Computing Sciences Computer Laboratory 
University of Sussex University of Cambridge 
Brighton BN1 9QH, UK Pembroke Street, Cambridge CB2 3QG, UK 
john. carroll@cogs.susx, ac. uk ejb @cl. cam. ac. uk 
Abstract 
We describe an implemented system for robust 
domain-independent syntactic parsing of English, 
using a unification-based grammar of part-of- 
speech and punctuation labels coupled with a 
probabilistic LR parser. We present evaluations 
of the system's performance along several differ- 
ent dimensions; these enable us to assess the con- 
tribution that each individual part is making to 
the success of the system as a whole, and thus 
prioritise the effort to be devoted to its further 
enhancement. Currently, the system is able to 
parse around 80% of sentences in a substantial 
corpus of general text containing a number of 
distinct genres. On a random sample of 250 
such sentences the system has a mean crossing 
bracket rate of 0.71 and recall and precision of 
83% and 84~0 respectively when evaluated against 
manually-disambiguated analyses I . 
1. INTRODUCTION 
This work is part of an effort to develop a ro- 
bust, domain-independent syntactic parser capa- 
ble of yielding the unique correct analysis for un- 
restricted naturally-occurring input. Our goal is 
to develop a system with performance compara- 
ble to extant part-of-speech taggers, returning a 
syntactic analysis from which predicate-argument 
structure can be recovered, and which can sup- 
port semantic interpretation. The requirement for 
a domain-independent analyser favours statistical 
1Some of this work was carried out while the 
second author was visiting Rank Xerox, Grenoble. 
The work was also supported by UK DTI/SALT 
project 41/5808 'Integrated Language Database', and 
by SERC/EPSRC Advanced Fellowships to both au- 
thors. Geoff Nunberg provided encouragement and 
much advice on the analysis of punctuation, and Greg 
Grefenstette undertook the original corpus tokenisa- 
tion and segmentation for the punctuation experi- 
ments. Bernie .\]ones and Kiku Ribas made helpful 
comments on an earlier draft. We are responsible for 
any mistakes. 
92 
techniques to resolve ambiguities, whilst the lat- 
ter goal favours a more sophisticated grammatical 
formalism than is typical in statistical approaches 
to robust analysis of corpus material. 
Briscoe ~ Carroll (1993) describe a proba- 
blistic parser using a wide-coverage unification- 
based grammar of English written in the Alvey 
Natural Language Tools (ANLT) metagrammat- 
ical formalism (Briscoe et al., 1987), generating 
around 800 rules in a syntactic variant of the Def- 
inite Clause Grammar formalism (DCG, Pereira 
Warren, 1980) extended with iterative (Kleene) 
operators. The ANLT grammar is linked to a lex- 
icon containing about 64K entries for 40K lex- 
emes, including detailed subcategorisation infor- 
mation appropriate for the grammar, built semi- 
automatically from a learners' dictionary (Car- 
roll L= Grover, 1989). The resulting parser is 
efficient, constructing a parse forest in roughly 
quadratic time (empirically), and efficiently re- 
turning the ranked n-most likely analyses (Car- 
roll, 1993, 1994). The probabilistic model is a 
refinement of probabilistic context-free grammar 
(PCFG) conditioning CF 'backbone' rule applica- 
tion on LR state and lookahead item. Unification 
of the 'residue' of features not incorporated into 
the backbone is performed at parse time in con- 
junction with reduce operations. Unification fail- 
ure results in the associated derivation being as- 
signed a probability of zero. Probabilities are as- 
signed to transitions in the LALR(1) action table 
via a process of supervised training based on com- 
puting the frequency with which transitions are 
traversed in a corpus of parse histories. The result 
is a probabilistic parser which, unlike a PCFG, is 
capable of probabilistically discriminating deriva- 
tions which differ only in terms of order of appli- 
cation of the same set of CF backbone rules, due 
to the parse context defined by the LR table. 
Experiments with this system revealed three 
major problems which our current research is ad- 
dressing. Firstly, improvements in probabilistic 
parse selection will require a 'lexicalised' gram- 
mar/parser in which (minimally) probabilities are 
associated with alternative subcategorisation pos- 
sibilities of individual lexical items. Currently, the 
relative frequency of subcategorisation possibili- 
ties for individual lexical items is not recorded in 
wide-coverage lexicons, such as ANLT or COM- 
LEX (Grishman e¢ al., 1994). Secondly, removal 
of punctuation from the input (after segmen- 
tation into text sentences) worsens performance 
as punctuation both reduces syntactic ambigu- 
ity (Jones, 1994) and signals non-syntactic (dis- 
course) relations between text units (Nunberg, 
1990). Thirdly, the largest source of error on un- 
seen input is the omission of appropriate subcate- 
gorisation values for lexical items (mostly verbs), 
preventing the system from finding the correct 
analysis. The current coverage--the proportion 
of sentences for which at least one analysis was 
foundS--of this system on a general corpus (e.g. 
Brown or LOB) is estimated to be around 20% 
by Briscoe (1994). Therefore, we have developed 
a variant probabilistic LR parser which does not 
rely on subcategorisation and uses punctuation to 
reduce ambiguity, The analyses produced by this 
parser can be utilised for phrase-finding applica- 
tions, recovery of subcategorisation frames, and 
other 'intermediate' level parsing problems. 
2. PART:OF-SPEECH TAG 
SEQUENCE GRAMMAR 
We utilised the ANLT metagrammatical formal- 
ism to develop a feature-based, declarative de- 
scription of part-of-speech (PoS) label sequences 
(see e.g. Church, 1988) for English. This gram- 
mar compiles into a DCG-like grammar of ap- 
proximately 400 rules. It has been designed 
to enumerate possible valencies for predicates 
(verbs, adjectives and nouns) by including sep- 
arate rules for each pattern of possible comple- 
mentation in English. The distinction between ar- 
guments and adjuncts is expressed, following X- 
bar theory (e.g. Jackendoff, 1977), by Chomsky- 
adjunction of adjuncts to maximal projections 
(XP ~ XP Adjunct) as opposed to government of 
arguments (i.e. arguments are sisters within X1 
projections; X1 --~ X0 Argl... ArgN). Although 
the grammar enumerates complementation pos- 
sibilities and checks for global sentential well- 
formedness, it is best described as 'intermediate' 
as it does not attempt to associate 'displaced' con- 
stituents with their canonical position / grammat- 
ical role. 
The other difference between this grammar 
2Briscoe & Carroll (1995) note that "coverage" is 
a weak measure since discovery of one or more global 
analyses does not entail that the correct analysis is 
recovered. 
and a more conventional one is that it incorporates 
some rules specifically designed to overcome lim- 
itations or idiosyncrasies of the tagging process. 
For example, past participles functioning adjec- 
tivally, as in (la), are fl'equently tagged as past 
participles (VVN) as in (lb), so the grammar in- 
corporates a rule (violating X-bar theory) which 
parses past participles as adjectival premodifiers 
in this context. 
(1) a The disembodied head 
b The_AT disembodied_VVN head_NN1 
Similar idiosyncratic rules are incorporated for 
dealing with gerunds, adjective-noun conversions, 
idiom sequences, and so forth. Further details of 
the PoS grammar are given in Briscoe & Carroll 
(1994, 1995). 
The grammar currently covers around 80% of 
the Susanne corpus (Sampson, 1995), a 138K word 
treebanked and balanced subset of the Brown cor- 
pus. Many of the 'failures' are due to the root 
S(entence) requirement enforced by the parser 
when dealing with fragments from dialogue and 
so forth. We have not relaxed this requirement 
since it increases ambiguity, our primary interest 
at this point being the extraction of subcategorisa- 
tion information from full clauses in corpus data. 
3. TEXT GRAMMAR AND 
PUNCTUATION 
Nunberg (1990) develops a partial 'text' grammar 
for English which incorporates mnany constraints 
that (ultimately) restrict syntactic and seman- 
tic interpretation. For example, textual adjunct 
clauses introduced by colons scope over following 
punctuation, as (2a) illustrates; whilst textual ad- 
juncts introduced by dashes cannot intervene be- 
tween a bracketed adjunct and the textual unit to 
which it attaches, as in (2b). 
(2) a *He told them his reason: he would not 
renegotiate his contract, but he did not 
explain to the team owners. (vs. but 
would stay) 
b *She left - who could blame her - (dur- 
ing the chainsaw scene) and went home. 
We have developed a declarative grammar in 
the ANLT metagrammatical formalism, based on 
Nunberg's procedural description. This grammar 
captures the bulk of the text-sentential constraints 
described by Nunberg with a grammar which com- 
piles into 26 DCG-tike rules. Text grammar anal- 
yses are useful because they demarcate some of 
the syntactic boundaries in the text sentence and 
thus reduce ambiguity, and because they identify 
the units for which a syntactic analysis should, in 
93 
principle, be found; for example, in (3), the ab- 
sence of dashes would mislead a parser into seek- 
ing a syntactic relationship between three and the 
following names, whilst in fact there is only a dis- 
course relation of elaboration between this text 
adjunct and pronominal three. 
(3) The three - Miles J. Cooperman, Sheldon 
Teller, and Richard Austin - and eight 
other defendants were charged in six in- 
dictments with conspiracy to violate fed- 
eral narcotic law. 
Further details of the text grammar are given 
in Briscoe ~ Carroll (1994, 1995). The text 
grammar has been tested on the Susanne corpus 
and covers 99.8% of sentences. (The failures are 
mostly text segmentation problems). The number 
of analyses varies from one (71%) to the thousands 
(0.1%). Just over 50% of Susanne sentences con- 
tain some punctuation, so around 20% of the sin- 
gleton parses are punctuated. The major source of 
ambiguity in the analysis of punctuation concerns 
the function of commas and their relative scope as 
a result of a decision to distinguish delimiters and 
separators (Nunberg 1990:36). Therefore, a text 
sentence containing eight commas (and no other 
punctuation) will have 3170 analyses. The mul- 
tiple uses of commas cannot be resolved without 
access to (at least) the syntactic context of occur- 
rence. 
4. THE INTEGRATED 
GRAMMAR 
Despite Nunberg's observation that text grammar 
is distinct from syntax, text grammatical ambigu- 
ity favours interleaved application of text gram- 
matical and syntactic constraints. Integrating the 
text and the PoS sequence grammars is straight- 
forward and the result remains modular, in that 
the text grammar is 'folded into' the PoS sequence 
grammar, by treating text and syntactic categories 
as overlapping and dealing with the properties of 
each using disjoint sets of features, principles of 
feature propagation, and so forth. In addition to 
the core text-grammatical rules which carry over 
unchanged from the stand-alone text grammar, 44 
syntactic rules (of pre- and post- posing, and co- 
ordination) now include (often optional) comma 
markers corresponding to the purely 'syntactic' 
uses of punctuation. 
The approach to text grammar taken here is in 
many ways similar to that of Jones (1994). How- 
ever, he opts to treat punctuation marks as clitics 
on words which introduce additional featural in- 
formation into standard syntactic rules. Thus, his 
grammar is thoroughly integrated and it would be 
harder to extract an independent text grammar 
or build a modular semantics. Our less-tightly in- 
tegrated grammar is described in more detail in 
Briscoe & Carroll (1994). 
5. PARSING THE SUSANNE AND 
SEC CORPORA 
We have used the integrated grammar to parse 
the Susanne corpus and the quite distinct Spoken 
English Corpus (SEC; Taylor ~ Knowles, 1988), a 
50K word treebanked corpus of transcribed British 
radio programmes punctuated by the corpus com- 
pilers. Both corpora were retagged using the Ac- 
quilex HMM tagger (Elworthy, 1993, 1994) trained 
on text tagged with a slightly modified version of 
CLAWS-II labels (Garside et al., 1987). In con- 
trast to previous systems taking as input fully- 
determinate sequences of PoS labels, such as Fid- 
ditch (Hindle, 1989) and MITFP (de Marcken, 
1990), for each word the tagger returns multiple 
label hypotheses, and each is thresholded before 
being passed on to the parser: a given label is re- 
tained if it is the highest-ranked, or, if the highest- 
ranked label is assigned a likelihood of less than 
0.9, if its likelihood is within a factor of 50 of this. 
We thus attempt to minimise the effect of incor- 
rect tagging on the parsing component by allow- 
ing label ambiguities, but control the increase in 
indeterminacy and concomitant decrease in subse- 
quent processing efficiency by applying the thresh- 
olding technique. On Susanne, retagging allowing 
only a single label per word results in a 97.90% 
label/word assignment accuracy, whereas multi- 
label tagging with this thresholding scheme results 
in 99.51% accuracy. 
In an earlier paper (Briscoe & Carroll, 1995) 
we gave results for a previous version of the gram- 
mar and parsing system. We have made a num- 
ber of significant improvements to the system since 
then, the most fundamental being the use of multi- 
ple labels for each word. System accuracy evalua- 
tion results are also improved since we now output 
trees that conform more closely to the annotation 
conventions employed in the test treebank. 
COVERAGE AND AMBIGUITY 
To examine the efficiency and coverage of the 
grammar we applied it to our retagged versions of 
Susanne and SEC. We used the ANLT chart parser 
(Carroll, 1993), but modified just to count the 
number of possible parses in the parse forests (Bil- 
lot ~ Lang, 1989) rather than actually unpacking 
them. We also imposed a per-sentence time-out 
of 30 seconds CPU time, running in Franz Alle- 
gro Common Lisp 4.2 on an HP PA-RISC 715/100 
workstation with 128 Mbytes of physical memory. 
For both corpora, the majority of sentences 
94 
Parse fails 
1-9 parses 
10-99 parses 
100-999 parses 
1K-9.9K parses 
10K-99K parses 
100K+ parses 
Time-outs 
Number of sentences 
Mean sentence length (MSL) 
MSL - fails 
MSL - time-outs 
Average Parse Base 
Susanne 
1476 21.0% 
1436 20.5% 
1218 17.4% 
953 13.6% 
694 9.9% 
474 6.8% 
750 10.7% 
13 0.2% 
7014 
20.1 
20.9 
73.6 
1.313 
SEC 
809 31.3% 
477 18.4% 
378 14.6% 
276 10.7% 
225 8.7% 
154 6.0% 
264 10.2% 
4 0.2% 
2717 
22.6 
29.5 
65.8 
1.300 
Table 1: Grammar coverage on Susanne and SEC 
analysed successfully received under 100 parses, 
although there is a long tail in the distribu- 
tion. Monitoring this distribution is helpful during 
grammar development to ensure that coverage is 
increasing but the ambiguity rate is not. A more 
succinct though less intuitive measure of ambigu- 
ity rate for a given corpus is Briscoe & Carroll's 
(1995) average parse base (APB), defined as the 
geometric mean over all sentences in the corpus 
of ¢/~, where n is the number of words in a sen- 
tence, and p, the number of parses for that sen- 
tence. Thus, given a sentence n words long, the 
APB raised to the nth power gives the number of 
analyses that the grammar can be expected to as- 
sign to a sentence of that length in the corpus. Ta- 
ble 1 gives these measures for all of the sentences 
in Susanne and in SEC. 
As the grammar was developed solely with ref- 
erence to Susanne, coverage of SEC is quite robust. 
The two corpora differ considerably since the for- 
mer is drawn from American written text whilst 
the latter represents British transcribed spoken 
material. The corpora overall contain material 
drawn from widely disparate genres / registers, 
and are more complex than those used in DARPA 
ATIS tests, and more diverse than those used 
in MUCs and probably also the Penn Treebank. 
Black et al. (1993) report a coverage of around 
95% on computer manuals, as opposed to our cov- 
erage rate of 70-80% on much more heterogeneous 
data and longer sentences. The APBs for Susanne 
and SEC of 1.313 and 1.300 respectively indicate 
that sentences of average length in each corpus 
could be expected to be assigned of the order of 
238 and 376 analyses (i.e. 1.3132°n and 1.300226). 
The parser throughput on these tests, for sen- 
tences successfully analysed, is around 25 words 
per CPU second on an HP PA-RISC 715/100. 
Sentences of up to 30 tokens (words plus sentence- 
internal punctuation) are parsed in an average of 
under 1 second each, whilst those around 60 tokens 
take on average around 7 seconds. Nevertheless, 
the relationship between sentence length and pro- 
cessing time is fitted well by a quadratic function, 
supporting the findings of Carroll (1994) that in 
practice NL grammars do not evince worst-case 
parsing complexity. 
Grammar Development & Refinement 
The results we report above relate to the latest 
version of the tag sequence grammar. To date, we 
have spent about one person-year on grammar de- 
velopment, with the effort spread fairly evenly over 
a two-and-a-half-year period. The various phases 
in the development and refinement of the grammar 
can be observed in an analysis of the coverage and 
APB for Susanne and SEC over this period--see 
table 2. The phases, with dates, were: 
6/92-11/93 Initial development of the grammar. 
11/93-7/94 Substantial increase in coverage on 
the development corpus (Susanne), correspond- 
ing to a drive to increase the general coverage 
of the grammar by analysing parse failures on 
actual corpus material. From a lower initial fig- 
ure, coverage of SEC (unseen corpus), increased 
by a larger factor. 
7/94-12/94 Incremental improvements in cover- 
age, but at the cost of increasing the ambiguity 
of the grammar. 
12/94-10/95 Improving the accuracy of the sys- 
tem by trying to ensure that the correct analysis 
was in the set returned. 
Since the coverage on SEC is increasing at the 
same time as on Susanne, we can conclude that 
the grammar has not been specifically tuned to 
the particular sublanguages or genres represented 
in the development corpus. Also, although the 
almost-50% initial coverage on the heterogeneous 
95 
Susanne 
date coverage APB 2°1 
11/93 47.8% 667 
1/94 56.7% 160 
7/94 75.3% 192 
12/94 79.0% 217 
10/95 79.0% 238 
SEC 
coverage 
34.3% 
45.7% 
67.1% 
68.9% 
68.7% 
Table 2: Grammar coverage and ambiguity during 
development 
text of Susanne compares well with the state-of- 
the-art in grammar-based approaches to NL anal- 
ysis (e.g. see Taylor el al., 1989; Alshawi el al., 
1992), it is clear that the subsequent grammar re- 
finement phases have led to major improvements 
in coverage and reductions in spurious ambiguity. 
We have experimented with increasing the 
richness of the lexical feature set by incorporating 
subcategorisation information for verbs into the 
grammar and lexicon. We constructed randomly 
from Susanne a test corpus of 250 in-coverage sen- 
tences, and in this, for each word tagged as pos- 
sibly being an open-class verb (i.e. not a modal 
or auxiliary) we extracted from the ANLT lexi- 
con (Carroll & Grover, 1989) all verbal entries for 
that word. We then mapped these entries into 
our PoS grammar experimental subcategorisation 
scheme, in which we distinguished each possible 
pattern of complementation allowed by the gram- 
mar (but not control relationships, specification 
of prepositional heads of PP complements etc. as 
in the full ANLT representation scheme). We 
then attempted to parse the test sentences, us- 
ing the derived verbal entries instead of the orig- 
inal generic entries which generalised over all the 
subcategorisation possibilities. 31 sentences now 
failed to receive a parse, a decrease in coverage of 
12%. This is due to the fact that the ANLT lexi- 
con, although large and comprehensive by current 
standards (Briscoe & Carroll, 1996), nevertheless 
contains many errors of omission. 
PARSE SELECTION 
A probabilistic LR parser was trained with the in- 
tegrated grammar by exploiting the Susanne tree- 
bank bracketing. An LR parser (Briscoe & Car- 
roll, 1993) was applied to unlabelled bracketed 
sentences from the Susanne treebank, and a new 
treebank of 1758 correct and complete analyses 
with respect to the integrated grammar was con- 
structed semi-automatically by manually resolving 
the remaining ambiguities. 250 sentences from the 
new treebank, selected randomly, were kept back 
96 
for testing 3. The remainder, together with a fur- 
ther set of analyses from 2285 treebank sentences 
that were not checked manually, were used to 
train a probabilistic version of the LR parser, us- 
ing Good-Turing smoothing to estimate the prob- 
ability of unseen transitions in the LALR(1) ta- 
ble (Briscoe & Carroll, 1993; Carroll, 1993). The 
probabilistic parser can then return a ranking of 
all possible analyses for a sentence, or efficiently 
return just the n-most probable (Carroll, 1993). 
The probabilistic parser was tested on the 
250 sentences held out from the manually- 
disambiguated treebank (of lengths 3-56 tokens, 
mean 18.2). The parser was set up to return 
only the highest-ranked analysis for each sentence. 
Table 3 shows the results of this test--with re- 
spect to the original Susanne bracketings--using 
the Grammar Evaluation Interest Group scheme 
(GEIG, see e.g. Harrison et al., 1991) 4. This com- 
pares unlabelled bracketings derived from corpus 
treebanks with those derived from parses for the 
same sentences by computing recall, the ratio of 
matched brackets over all brackets in the treebank; 
precision, the ratio of matched brackets over all 
brackets found by the parser; mean crossings, the 
number of times a bracketed sequence output by 
the parser overlaps with one from the treebank 
but neither is properly contained in the other, av- 
eraged over all sentences; and zero crossings, the 
percentage of sentences for which the analysis re- 
turned has zero crossings. 
The table also gives an indication of the best 
and worst possible performance of the disambigua- 
tion component of the system, showing the results 
obtained when parse selection is replaced by a sim- 
ple random choice, and the results of evaluating 
the analyses in the manually-disambiguated tree- 
bank against the corresponding original Susanne 
bracketings. In this latter figure, the mean number 
of crossings (0.41) is greater than zero mainly be- 
cause of incompatibilities between the structural 
representations chosen by the grammarian and the 
corresponding ones in the treebank. Precision is 
less than 100% due to crossings, minor mismatches 
and inconsistencies (due to the manual nature of 
the markup process) in tree annotations, and the 
fact that Susanne often favours a "flat" treatment 
of VP constituents, whereas our grammar always 
makes an explicit choice between argument- and 
adjunct-hood. Thus, perhaps a more informa- 
tive test of the accuracy of our probabilistic sys- 
tem would be evaluation against the manually- 
disambiguated corpus of analyses assigned by the 
grammar. In this, the mean crossing figure drops 
3The appendix contains a random sample of sen- 
tences from the test corpus. 
4We would like to thank Phil Harrison for supplying 
the evaluation software. 
Zero Mean Recall Precision 
crossings crossings 
Probabilistic parser analyses 
Top-ranked analysis 59.6% 1.03 74.0% 73.0% 
Random analysis 40.4% 1.84 58.6% 60.0% 
Manually-disambiguated analyses 
'Ideal' analysis 80.1% 0.41 85.4% 82.9% 
Table 3: GEIG evaluation metrics for test set of 250 held-back sentences against Susanne bracketings 
to 0.71 and the recall and precision rise to 83-84%, 
as shown in table 4. 
Black el al. (1993:7) use the crossing brackets 
measure to define a notion of structural consis- 
tency, where the structural consistency rate for the 
grammar is defined as the proportion of sentences 
for which at least one analysis--from the many 
typically returned by the grammar--contains no 
crossing brackets, and report a rate of around 
95% for the IBM grammar tested on the com- 
puter manual corpus. However, a problem with 
the GEIG scheme and with structural consistency 
is that both are still weak measures (designed 
to avoid problems of parser/treebank represen- 
tational compatibility) which lead to unintuitive 
numbers whose significance still depends heavily 
on details of the relationship between the repre- 
sentations compared (e.g. between structure as- 
signed by a grammar and that in a treebank). One 
particular problem with the crossing bracket mea- 
sure is that a single attachment mistake embedded 
n levels deep (and perhaps completely innocuous, 
such as an "aside" delimited by dashes) can lead 
to n crossings being assigned, whereas incorrect 
identification of arguments and adjuncts can go 
unpunished in some cases. 
Schabes et al. (1993) and Magerman (1995) 
report results using the GEIG evaluation scheme 
which are numerically similar in terms of parse se- 
lection to those reported here, but achieve 100% 
coverage. However, their experiments are not 
strictly comparable because they both utilise more 
homogeneous and probably simpler corpora. (The 
appendix gives an indication of the diversity of 
the sentences in our corpus). In addition, Sch- 
abes et al. do not recover tree labelling, whilst 
Magerman has developed a parser designed to pro- 
duce identical analyses to those used in the Penn 
Treebank, removing the problem of spurious er- 
rors due to grammatical incompatibility. Both 
these approaches achieve better coverage by con- 
structing the grammar fully automatically, but as 
an inevitable side-effect the range of text phenom- 
ena that can be parsed becomes limited to those 
present in the training material, and being able to 
deal with new ones would entail further substan- 
tial treebanking efforts. 
To date, no robust parser has been shown 
to be practical and useful for some NLP task. 
However, it seems likely that, say, rule-to-rule se- 
mantic interpretation will be easier with hand- 
constructed grammars with an explicit, determi- 
nate rule-set. A more meaningful parser compar- 
ison would require application of different parsers 
to an identical and extended test suite and utilisa- 
tion of a more stringent standard evaluation pro- 
cedure sensitive to node labellings. 
Training Data Size and Accuracy 
Statistical HMM-based part-of-speech taggers re- 
quire of the order of 100K words and upwards of 
training data (Weischedel et al., 1993:363); tag- 
gers inducing non-probabilistic rules (e.g. Brill, 
1994) require similar amounts (Gaizauskas, pc). 
Our probabilistic disambiguation system currently 
makes no use of lexical frequency information, 
training only on structural configurations. Nev- 
ertheless, the number of parameters in the prob- 
abilistic model is large: it is the total number of 
possible transitions in an LALR(1) table contain- 
ing over 150000 actions. It is therefore interesting 
to investigate whether the system requires more 
or less training data than a tagger. 
We therefore ran the same experiment as 
above, using GEIG to measure the accuracy of 
the system on the 250 held-back sentences, but 
varying the amount of training data with which 
the system was provided. We started at the full 
amount (3793 trees), and then successively halved 
it by selecting the appropriate number of trees at 
random. The results obtained are given in figure 1. 
The results show convincingly that the system 
is extremely robust when confronted with limited 
amounts of training data: when using a mere one 
sixty-fourth of the full amount (59 trees), accuracy 
was degraded by only 10-20%. However, there 
is a large decrease in accuracy with no training 
data (i.e. random choice). Conversely, accuracy is 
still improving at 3800 trees, with no sign of over- 
training, although it appears to be approaching an 
upper asymptote. To determine what this might 
97 
Zero Mean Recall Precision 
crossings crossings 
Probabilistic parser analyses 
Top-ranked analysis 67.2% 0.71 82.9% 83.9% 
Table 4: GEIG evaluation metrics for test set of 250 held-back sentences against the manually-disambigated 
analyses 
2- 
1.5- 
100% 1.0- 
50% 0.5- 
0% o.o 
[ \[\] Mean crossings \[~\] Recall \[~\] Precision 
\[~ Zero crossings 
\[\] ................................... \[\] ...................................... El .................................. \[\] ............................. \[\] ......................................... \[\] ......................................... \[\] .............................. 
...~ 
I I I i I I f 
All 1/2 1/4 1/8 1/16 1/32 1/64 None 
Fraction of 3793 training sentences used 
Figure 1: GEIG metrics for held-back sentences, training on varying amounts of data 
be, we ran the system on a set of 250 sentences ran- 
domly extracted from the training corpus. On this 
set, the system achieves a zero crossings rate of 
60.0%, mean crossings 0.88, and recall and preci- 
sion of 77.0% and 75.2% respectively, with respect 
to the original Susanne bracketings. Although this 
is a different set of sentences, it is likely that the 
upper asymptote for accuracy for the test corpus 
lies in this region. Given that accuracy is increas- 
ing only slowly and is relatively close to the asymp- 
tote it is therefore unlikely that it would be worth 
investing effort in increasing the size of the train- 
ing corpus at this stage in the development of the 
system. 
6. CONCLUSIONS 
In this paper we have outlined an approach to ro- 
bust domain-independent parsing, in which sub- 
categorisation constraints play no part, resulting 
in coverage that greatly improves upon more con- 
ventional grammar-based approaches to NL text 
analysis. We described an implemented system, 
and evaluated its performance along several dif- 
ferent dimensions. We assessed its coverage and 
that of previous versions on a development cor- 
pus and an unseen corpus, and demonstrated that 
the grammar refinement we have carried out has 
led to substantial improvements in coverage and 
reductions in spurious ambiguity. We also evalu- 
ated the accuracy of parse selection with respect 
to treebank analyses, and, by varying the amount 
of training material, we showed that it requires 
comparatively little data to achieve a good level 
of accuracy. 
We have made good progress in increasing 
grammar coverage, though we have now reached 
a point of diminishing returns. Further significant 
improvements in this area would require corpus- 
specific additions and tuning whose benefit would 
not necessarily carry over to other corpora. In the 
application we are currently using the system for-- 
automatic extraction of subcategorisation frames, 
and more generally argument structure, from large 
amounts of text (Briscoe ~ Carroll, 1996)--we do 
not need full coverage; 70-80% appears to be suf- 
ficient. However, further improvements in cover- 
age will require some automated approach to rule 
induction driven by parse failure. Since our eval- 
uations indicate that our system achieves a good 
98 
level of accuracy with little treebank data, and 
that 67-75% coverage was achieved for English 
quite early in the grammar refinement effort, port- 
ing the current system to other languages should 
be possible with small-to-medium-sized treebanks 
(around 20K words) and feasible manual effort 
(of the order of 12 person-months for grammar- 
writing and treebanking). This may yield a sys- 
tem accurate enough for some types of application, 
given that the system is not restricted to return- 
ing the single highest ranked analysis but can re- 
turn the n-highest ranked for further application- 
specific selection. 
Although we report promising results, parse 
selection that is sufficiently accurate for many 
practical applications will require a more lexi- 
calised system. Magerman's (1995) parser is an 
extension of the history-based parsing approach 
developed at IBM (Black et al., 1993) in which 
rules are conditioned on lexical and other (es- 
sentially arbitrary) information available in the 
parse history. In future work, we intend to ex- 
plore a more restricted and semantically-driven 
version of this approach in which, firstly, probabili- 
ties are associated with different subcategorisation 
possibilities, and secondly, alternative predicate- 
argument structures derived from the grammar 
are ranked probabilistically. However, the mas- 
sively increased coverage obtained here by relaxing 
subcategorisation constraints underlines the need 
to acquire accurate and complete subcategorisa- 
tion frames in a corpus-driven fashion, before such 
constraints can be exploited robustly and effec- 
tively with free text. 

REFERENCES 
Alshawi, H., Carter, D., Crouch, R., Pulman, S., 
Rayner, M., ~ Smith, A. 1992. CLARE: a contex- 
tual reasoning and cooperative response framework 
for the Core Language Engine. SRI International, 
Cambridge, UK. 
Billot, S. & Lang, B. 1989. The structure of shared 
forests in ambiguous parsing. In Proceedings of 
the 27lh Meeting of Association for Computational 
Linguistics, Vancouver, Canada. 143-151. 
Black, E., Garside, R. & Leech, G. (eds.) 1993. 
Statistically-driven computer grammars of En- 
glish: the IBM~ Lancaster approach. Amsterdam, 
The Netherlands: Rodopi. 
Brill, E. 1994. Some advances in transformation- 
based part of speech tagging. In Proceedings of the 
12th National Conference on Artificial Intelligence 
(AAAI-94), Seattle, WA. 
Briscoe, E. 1994. Prospects for practical parsing of 
unrestricted text: robust statistical parsing tech- 
niques. In Oostdijk, N L: de Haan, P. eds. Corpus- 
based Research into Language. Rodopi, Amster- 
dam: 97-120. 
Briscoe, E. & Carroll, J. 1993. Generalised prob- 
abilistic LR parsing for unification-based gram- 
mars. Computational Linguistics 19.1: 25-60. 
Briscoe, E. ,~ Carroll, J. 1994. Parsing ('with) 
punctuation etc. Rank Xerox Research Centre, 
Grenoble, MLTT-TR-007. 
Briscoe, E. ,~ Carroll, J. 1995. Developing and 
evaluating a probabilistic LR parser of part-of- 
speech and punctuation labels. In Proceedings of 
the 4th ACL/SIGPARSE International Workshop 
on Parsing Technologies, Prague, Czech Republic. 
48-58. 
Briscoe, E.   Carroll, J. 1996. Automatic extrac- 
tion of subcalegorization from corpora. Under re- 
view. 
Briscoe, E., Grovel', C., Boguraev, B. & Carroll, J. 
1987. A formalism and environment for the devel- 
opment of a large grammar of English. In Proceed- 
ings of the lOth International Joint Conference on 
Artificial Intelligence, Milan, Italy. 703-708. 
Carroll, J. 1993. Practical unification-based pars- 
ing of natural language. Cambridge University, 
Computer Laboratory, TR-314. 
Carroll, J. 1994. Relating complexity to prac- 
tical performance in parsing with wide-coverage 
unification grammars. In Proceedings of the 32nd 
Meeting of Association for Computational Lin- 
guistics, Las Cruces, NM. 287-294. 
Carroll, J. ~: Grover, C. 1989. The derivation 
of a large computational lexicon for English from 
LDOCE. In Boguraev, B. &: Briscoe, E. eds. Com- 
putational Lexicography for Natural Language Pro- 
cessing. Longman, London: 117-134. 
Church, K. 1988. A stochastic parts program and 
noun phrase parser for unrestricted text. In Pro- 
ceedings of the 2nd Conference on Applied Natural 
Language Processing, Austin, Texas. 136-143. 
Elworthy, D. 1993. Part-of-speech lagging and 
phrasal tagging. Acquilex-II Working Paper 10, 
Cambridge University Computer Laboratory (can 
be obtained from cide@cup.cam.ac.uk). 
Elworthy, D. 1994. Does Baum:Welch re- 
estimation help taggers?. In Proceedings of the 4th 
Conference on Applied NLP, Stuttgart, Germany. 
Garside, R., Leech, G. & Sampson, G. 1987. Com- 
putational analysis of English. Harlow, UK: Long- 
mail. 
Grishman, R., Macleod, C. & Meyers, A. 1994. 
Comlex syntax: building a computational lexicon. 
In Proceedings of the International Conference on 
Computational Linguistics, COLING-94, Kyoto, 
Japan. 268-272. 
Harrison, P., Abney, S., Black, E., Flickenger, 
D., Gdaniec, C., Grishman, R., Hindle, D., In- 
gria, B., Marcus, M., Santorini, B. & Strza- 
lkowski, T. 1991. Evaluating syntax performance 
of parser/grammars of English. In Proceedings 
of the Workshop on Evaluating Natural Language 
Processing Systems, ACL. 
Hindle, D. 1989. Acquiring disambiguation rules 
from text. In Proceedings of the 27th Annual Meet- 
ing of the Association for Computational Linguis- 
tics, Vancouver, Canada. 118-25. 
Jackendoff, R. 1977. X-bar syntax. Cambridge, 
MA: MIT Press. 
Jones, B. 1994. Can punctuation help parsing?. 
In Proceedings of the International Conference on 
Computational Linguistics, COLING-94, Kyoto, 
Japan. 
Magerman, D. 1995. Statistical decision-tree mod- 
els for parsing. In Proceedings of the 33rd Annual 
Meeting of the Association for Computational Lin- 
guistics, Boston, MA. 
de Marcken, C. 1990. Parsing the LOB corpus. 
In Proceedings of the 28th Annual Meeting of the 
Association for Computational Linguistics, New 
York. 243-251. 
Nunberg, G. 1990. The linguistics of punctuation. 
CSLI Lecture Notes 18, Stanford, CA. 
Pereira, F. & Warren, D. 1980. Definite clause 
grammars for language analysis - a survey of the 
formalism and a comparison with augmented tran- 
sition networks. Artificial Intelligence 13.3: 231- 
278. 
Sampson, G. 1995. English for the computer. Ox- 
ford, UK: Oxford University Press. 
Schabes, Y., Roth, M. & Osborne, R. 1993. Pars- 
ing of the Wall Street Journal with the inside- 
outside algorithm. In Proceedings of the Meeting 
of European Association for Computational Lin- 
guistics, Utrecht, The Netherlands. 
Taylor, L., Grover, C. & Briscoe, E. 1989. The 
syntactic regularity of English noun phrases. In 
Proceedings of the 4th European Meeting of the As- 
sociation for Computational Linguistics, Manch- 
ester, UK. 256-263. 
Taylor, L. &: Knowles, G. 1988. Manual of in- 
formation to accompany the SEC corpus: the 
machine-readable corpus of spoken English. Uni- 
versity of Lancaster, UK, Ms. 
Weischedel, R., Meteer, M., Schwartz, R., 
Ramshaw, L. & Palmucci J. 1993. Coping with 
ambiguity and unknown words through probabilis- 
tic models. Computational Linguistics 19(2): 359- 
382. 
